How to get href with BeautifulSoup

How to get href with BeautifulSoup - python

The Situation
I want to scrape from this website:
http://www.dpm.tn/dpm_pharm/medicament/listmedicparnomspec.php
My code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from bs4 import BeautifulSoup
# agent
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"
# headless driver
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument(f'user-agent={user_agent}')
options.add_argument("--window-size=1920,1080")
options.add_argument('--ignore-certificate-errors')
options.add_argument('--allow-running-insecure-content')
options.add_argument("--disable-extensions")
options.add_argument("--proxy-server='direct://'")
options.add_argument("--proxy-bypass-list=*")
options.add_argument("--start-maximized")
options.add_argument('--disable-gpu')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')
driver = webdriver.Chrome(executable_path="D:\Downloads\chromedriver.exe", options=options)
# request test
medecine = 'doliprane'
# submiting a search
driver.get('http://www.dpm.tn/dpm_pharm/medicament/listmedicparnomspec.php')
e = driver.find_element_by_name('id')
e.send_keys(medecine)
e.submit()
# geting the result table
try:
table = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody')
print('succes')
except:
print('failed')
The code to get the link :
print('bs4 turn \n')
result = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
rows = result.find_all('tr')
links = []
real_link = []
for row in rows:
links.append(row.find('a', href= True))
for each in links:
print(each['href'])
The Problem:
Whenever running this I always get this error:
'NoneType' object is not subscriptable
The question:
How can I get this and find the href attribute as required?

Instead of using selenium use the requests library and fetch data and parse it.
Code:
import re
import requests
from bs4 import BeautifulSoup
medecine = 'doliprane'
url = "http://www.dpm.tn/dpm_pharm/medicament/listmedicspec.php"
payload = {"id":medecine}
response = requests.post(url, data=payload)
parsedhtml=BeautifulSoup(response.content,"html.parser")
regex = re.compile('fiche.php.*')
atag=parsedhtml.findAll("a",{"href":regex})
links =[i['href'].replace("fiche.php","http://www.dpm.tn/dpm_pharm/medicament/fiche.php") for i in atag ]
print(links)
Let me know if you have any questions :)

When accessing it try with:
print('bs4 turn \n')
result = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
rows = result.find_all('tr')
links = []
real_link = []
for row in rows:
a = row.find("a", href=True)
links.append(a['href'])
for each in links:
print(each)

I solved it but using selenium instead of Beautiful soup:
for i in range(2, max):
a_driver = driver.find_element_by_xpath(f'/html/body/table/tbody/tr/td/table/tbody/tr[{i}]/td[11]/a')
result2 = BeautifulSoup(a_driver.get_attribute('innerHTML'), 'lxml')
link = a_driver.get_attribute('href')
links.append(link)
for i in range(0, len(links)):
print(links[i])
this is worked for me

Related

Why does my Code using Selenium has such a long iteration time in a for-loop in Python? (Chromedriver)

I'm a beginner in Python, so please be patient with me.
I want to extract some simple data from an array of URLs.
All the URLs HTML-Contents have the same structur, so extracting the data by using a for-loop works out fine.
I use Selenium, because I found out, that the Websites JavaScript changes the initial HTML-Code and I want the final HTML-Code to work on.
For every iteration it takes around 4 seconds, which is accummulated a lot of time.
I already found out, that wait.until(page_has_loaded) alone takes half of the time of the code.
import win32com.client as win32
import requests
import openpyxl
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'}
driver_path = 'C:\\webdrivers\\chromedriver.exe'
dir = "C:\\Users\\Me\\OneDrive\\Dokumente_\\Notizen\\CSGOItems\\CSGOItems.xlsx"
workbook = openpyxl.load_workbook(dir)
sheet1 = workbook["Tabelle1"]
sheet2 = workbook["AllPrices"]
URLSkinBit = [
'https://skinbid.com/auctions?mh=Gamma%20Case&sellType=fixed_price&skip=0&take=30&sort=price%23asc&ref=csgoskins',
'https://skinbid.com/listings?popular=false&goodDeals=false&sort=price%23asc&take=10&skip=0&search=gamma&sellType=all',
'https://skinbid.com/auctions?mh=Danger%20Zone%20Case&sellType=fixed_price&skip=0&take=30&sort=price%23asc&ref=csgoskins',
'https://skinbid.com/auctions?mh=Dreams%20%26%20Nightmares%20Case&sellType=fixed_price&skip=0&take=30&sort=price%23asc&ref=csgoskins',
'https://skinbid.com/listings?popular=false&goodDeals=false&sort=price%23asc&take=10&skip=0&search=vanguard&sellType=all',
'https://skinbid.com/listings?popular=false&goodDeals=false&sort=price%23asc&take=10&skip=0&search=chroma%203&sellType=all',
'https://skinbid.com/listings?popular=false&goodDeals=false&sort=price%23asc&take=10&skip=0&search=spectrum%202&sellType=all',
'https://skinbid.com/listings?popular=false&goodDeals=false&sort=price%23asc&take=10&skip=0&search=clutch&sellType=all',
'https://skinbid.com/listings?popular=false&goodDeals=false&sort=price%23asc&take=10&skip=0&search=snakebite&sellType=all',
'https://skinbid.com/listings?popular=false&goodDeals=false&sort=price%23asc&take=10&skip=0&search=falchion&sellType=all',
'https://skinbid.com/listings?popular=false&goodDeals=false&sort=price%23asc&take=10&skip=0&search=fracture&sellType=all',
'https://skinbid.com/listings?popular=false&goodDeals=false&sort=price%23asc&take=10&skip=0&search=prisma%202&sellType=all',
]
def page_has_loaded(driver):
return driver.execute_script("return document.readyState") == "complete"
def SkinBitPrices():
global count3
count3 = 0
with webdriver.Chrome(executable_path=driver_path) as driver:
for url in URLSkinBit:
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(page_has_loaded)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
container = soup.find('div', {'class': 'price'}).text
price = float(container.replace(' € ', ''))
print("%.2f" % price)
#Edit Excel-File
cell = str(3 + count3)
sheet2['B' + cell] = price
count3 += 1
driver.quit()
workbook.save(dir)
workbook.close()
return
SkinBitPrices()
Do you see possibilities to improve the performance here?
Thanks alot.

Page source not keeping up with updates made to the page by Selenium

I'm using the following code to scrape a web page:
import scrapy
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException, StaleElementReferenceException
class JornaleconomicoSpider(scrapy.Spider):
name = 'jornaleconomico'
allowed_domains = ['jornaleconomico.pt']
start_urls = ['https://jornaleconomico.pt/categoria/economia']
def parse(self, response):
options = Options()
driver_path = '###' #Your Chrome Webdriver Path
browser_path = '###' #Your Google Chrome Path
options.binary_location = browser_path
options.add_experimental_option("detach", True)
self.driver = webdriver.Chrome(options=options, executable_path=driver_path)
self.driver.get(response.url)
ignored_exceptions=(NoSuchElementException,StaleElementReferenceException,)
wait = WebDriverWait(self.driver, 120, ignored_exceptions=ignored_exceptions)
self.new_src = None
self.new_response = None
i=0
while i<10:
# click next link
try:
element = wait.until(EC.element_to_be_clickable((By.XPATH, '*//div[#class="je-btn je-btn-more"]')))
self.driver.execute_script("arguments[0].click();", element)
self.new_src = self.driver.page_source
self.new_response = response.replace(body=self.new_src)
i += 1
except TimeoutException:
self.logger.info('No more pages to load.')
self.driver.quit()
break
# grab the data
headlines = self.new_response.xpath('*//h1[#class="je-post-title"]/a/text()').extract()
for headline in headlines:
yield {
'text': headline
}
The code above is supposed to click 10 times on Ver mais artigos (See More Articles) and get the text from all the headlines, but it's getting only the first original nine headlines. I checked the page source code on Chrome Selenium (using the options.add_experimental_option("detach", True) line to freeze the Selenium window), and I figured out that the page source is the same as the original page, before the clicks. For me, this shouldn't be happening, since in that same Selenium window I can correctly inspect all articles, not just the first nine, and even using WebDriveWait is not preventing this from happening. How to solve this?

Here is the (almost) complete solution:
from json import loads, dumps
from requests import get, post
from lxml.html import fromstring
from re import search, sub, findall
headerz = {
"accept": "application/json, text/javascript, */*; q=0.01",
"accept-language": "en-US,en;q=0.9",
"sec-ch-ua": "'Chromium';v='106', 'Google Chrome';v='106', 'Not;A=Brand';v='99'",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "'Linux'",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-site",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
url = "https://jornaleconomico.pt/categoria/economia"
pag_href = "https://jornaleconomico.pt/wp-admin/admin-ajax.php"
page_count = 0
r = get(url)
html = fromstring(r.content.decode())
rawnonce = html.xpath("//script[#id='je-main-js-extra']/text()")
# print first 9 records
for p in html.xpath("//div[contains(#class,'je-posts-container')]//h1[contains(#class,'je-post-title')]/a"):
ptitle = p.xpath("./text()")
if isinstance(ptitle, list):
post_title = ptitle[0]
post_href = p.xpath("./#href")[0]
print (post_href)
# pagination
while True:
page_count += 9
pag_params = {
"action":"je_pagination",
"nonce": "",
"je_offset": page_count,
"je_term": "economia"
}
r = post(pag_href, headers=headerz, data=pag_params)
jdata = r.json()
if (jdata and 'data' in jdata):
jdata = jdata['data']['posts']
html = fromstring(jdata)
for p in html.xpath("//h1[contains(#class,'je-post-title')]/a"):
ptitle = p.xpath("./text()")
if isinstance(ptitle, list):
post_title = ptitle[0]
post_href = p.xpath("./#href")[0]
print (post_href)
else:
break
The output looks like:
https://jornaleconomico.pt/noticias/ministro-das-financas-diz-que-o-governo-esta-a-acompanhar-de-forma-atenta-inflacao-dos-produtos-alimentares-989146
https://jornaleconomico.pt/noticias/bancos-amortizam-antecipadamente-pagamento-dos-ltro-ao-bce-no-valor-de-16-mil-milhoes-989098
https://jornaleconomico.pt/noticias/je-podcast-ouca-aqui-as-noticias-mais-importantes-desta-terca-feira-51-988633
https://jornaleconomico.pt/noticias/prestacao-da-casa-sobe-quase-200-euros-para-creditos-de-150-mil-euros-a-6-meses-989134
https://jornaleconomico.pt/noticias/portugal-2020-atinge-85-de-execucao-e-116-de-compromisso-ate-dezembro-989132
https://jornaleconomico.pt/noticias/crescimento-do-pib-de-67-da-mais-confianca-para-desempenho-de-2023-diz-fernando-medina-989124
https://jornaleconomico.pt/noticias/apesar-dos-reforcos-salario-minimo-portugues-continua-a-meio-da-tabela-na-europa-988979
https://jornaleconomico.pt/noticias/atividade-turistica-dormidas-aumentaram-863-face-a-2021-988958
https://jornaleconomico.pt/noticias/producao-industrial-cresceu-25-em-dezembro-988947
https://jornaleconomico.pt/noticias/pib-cresce-36-na-ue-e-35-na-zona-euro-988921
https://jornaleconomico.pt/noticias/fundo-soberano-da-noruega-regista-maiores-perdas-desde-2008-988905
https://jornaleconomico.pt/noticias/economia-do-reino-unido-e-a-unica-do-g7-com-perspectivas-de-crescimento-negativo-988900
https://jornaleconomico.pt/noticias/economia-portuguesa-cresceu-67-em-2022-988868
https://jornaleconomico.pt/noticias/revista-de-imprensa-nacional-as-noticias-que-estao-a-marcar-esta-terca-feira-48-988814
https://jornaleconomico.pt/noticias/economia-chinesa-com-fortes-perspetivas-de-crescimento-988855
https://jornaleconomico.pt/noticias/fmi-reve-em-alta-as-previsoes-globais-de-crescimento-global-para-2023-e-agradece-a-china-988823
https://jornaleconomico.pt/noticias/alemanha-vendas-a-retalho-registam-a-maior-queda-desde-abril-de-2021-988817
https://jornaleconomico.pt/noticias/je-bom-dia-ine-divulga-dados-sobre-a-inflacao-e-a-economia-988416
https://jornaleconomico.pt/noticias/economia-francesa-cresce-26-em-2022-988767
https://jornaleconomico.pt/noticias/topo-da-agenda-o-que-nao-pode-perder-nos-mercados-e-na-economia-esta-terca-feira-31-988687
https://jornaleconomico.pt/noticias/auditoria-da-igf-ao-sifide-deteta-319-milhoes-de-euros-em-credito-fiscal-indevido-988699
https://jornaleconomico.pt/noticias/ministerio-das-infraestruturas-esta-a-acompanhar-subida-de-precos-das-operadoras-988697
https://jornaleconomico.pt/noticias/economistas-preveem-crescimento-do-pib-entre-66-e-68-em-2022-988681
https://jornaleconomico.pt/noticias/queda-do-pib-em-cadeia-na-alemanha-faz-soar-alarmes-de-recessao-na-zona-euro-de-novo-988583
https://jornaleconomico.pt/noticias/je-podcast-ouca-aqui-as-noticias-mais-importantes-desta-segunda-feira-49-988067
https://jornaleconomico.pt/noticias/jmj-investimentos-da-igreja-do-governo-e-dos-municipios-somam-pelo-menos-155-milhoes-de-euros-988637
https://jornaleconomico.pt/noticias/da-energia-europeia-a-economia-chinesa-veja-as-escolhas-da-semana-no-mercados-em-acao-988544
https://jornaleconomico.pt/noticias/riscos-de-uma-nova-moeda-comum-para-brasil-e-argentina-ouca-o-podcast-atlantic-connection-988395
https://jornaleconomico.pt/noticias/sindicatos-reunem-se-hoje-com-governo-para-tentar-evitar-greve-na-cp-e-ip-988622
https://jornaleconomico.pt/noticias/fundo-europeu-para-os-media-e-informacao-abre-novos-concursos-988564
https://jornaleconomico.pt/noticias/pt2020-portugal-entre-paises-que-mais-executam-fundos-europeus-988590
https://jornaleconomico.pt/noticias/maiores-bancos-espanhois-preparam-se-para-contestar-taxa-sobre-lucros-caidos-do-ceu-988545

You don't actually need to use Selenium for this very-easy-to-fetch website. Here is what I would do if I need data from there.
Testing with postman
POST https://domain.pt/wp-admin/admin-ajax.php
content-type: application/x-www-form-urlencoded; charset=UTF-8
pragma: no-cache
user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36
x-requested-with: XMLHttpRequest
action=je_pagination&nonce=f2e925cd72&je_offset=9&je_term=economia
First 9 records of the blog are printed with the link and pagination can be done using above postman sample, just change the 'je_offset' to [9,18,27,etc] and updating the 'nonce'.
Every time you load the page, you need to get new 'nonce' from html.
This is what the website shows on every page, try using re.search to get 'ajax_nonce' value.
<script type='text/javascript' id='je-main-js-extra'>
/* <![CDATA[ */
var ajax_object = {"ajax_url":"https:\/\/domain.pt\/wp-admin\/admin-ajax.php","ajax_nonce":"f2e925cd72"};
/* ]]> */
</script>
Try load the page using requests.get and paginate using requests.post - this should make your job super easy and much faster than selenium.

Selenium doesn't return all elements required

I'm trying to get a bunch of links of the houses from this website but it only returns only about 9 elements even though it has more elements. I also tried using Beautiful Soup but the same thing happens and it doesn't return all elements.
With Selenium:
for i in range(10):
time.sleep(1)
scr1 = driver.find_element_by_xpath('//*[#id="search-page-list-container"]')
driver.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", scr1)
link_tags = driver.find_elements_by_css_selector(".list-card-info a")
links = [link.get_attribute("href") for link in link_tags]
pprint(links)
With bs4:
headers = {
'accept-language': 'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/74.0.3729.131 Safari/537.36'
}
response = requests.get(ZILLOW_URL, headers=headers)
website_content = response.text
soup = BeautifulSoup(website_content, "html.parser")
link_tags = soup.select(".list-card-info a")
link_list = [link.get("href") for link in link_tags]
pprint(link_list)
Output:
'https://www.zillow.com/b/407-fairmount-ave-oakland-ca-9NTzMK/',
'https://www.zillow.com/homedetails/1940-Buchanan-St-A-San-Francisco-CA-94115/15075413_zpid/',
'https://www.zillow.com/homedetails/2380-California-St-QZ6SFATJK-San-Francisco-CA-94115/2078197750_zpid/',
'https://www.zillow.com/homedetails/5687-Miles-Ave-Oakland-CA-94618/299065263_zpid/',
'https://www.zillow.com/b/olume-san-francisco-ca-65f3Yr/',
'https://www.zillow.com/homedetails/29-Balboa-St-APT-1-San-Francisco-CA-94118/2092859824_zpid/']
Is there any way to tackle this problem? I would really appreciate the help.

You have to scroll to each element one by one in a loop and then have to look for descendant anchor tag which has the href.
driver.maximize_window()
#driver.implicitly_wait(30)
wait = WebDriverWait(driver, 50)
driver.get("https://www.zillow.com/homes/for_rent/1-_beds/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22mapBounds%22%3A%7B%22west%22%3A-123.7956336875%2C%22east%22%3A-121.6368202109375%2C%22south%22%3A37.02044483468766%2C%22north%22%3A38.36482775108166%7D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22price%22%3A%7B%22max%22%3A872627%7D%2C%22beds%22%3A%7B%22min%22%3A1%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22mp%22%3A%7B%22max%22%3A3000%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A9%7D")
j = 1
for i in range(len(driver.find_elements(By.XPATH, "//article"))):
all_items = driver.find_element_by_xpath(f"(//article)[{j}]")
driver.execute_script("arguments[0].scrollIntoView(true);", all_items)
print(all_items.find_element_by_xpath('.//descendant::a').get_attribute('href'))
j = j + 1
time.sleep(2)
Imports:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Output :
https://www.zillow.com/b/bay-village-vallejo-ca-5XkKWj/
https://www.zillow.com/b/waterbend-apartments-san-francisco-ca-9NLgqG/
https://www.zillow.com/b/the-verdant-apartments-san-jose-ca-5XsGhW/
https://www.zillow.com/homedetails/1539-Lincoln-Ave-San-Rafael-CA-94901/80743209_zpid/
https://www.zillow.com/b/the-crossing-at-arroyo-trail-livermore-ca-5XjR44/
https://www.zillow.com/homedetails/713-Trancas-St-APT-4-Napa-CA-94558/2081608744_zpid/
https://www.zillow.com/b/americana-apartments-mountain-view-ca-5hGhMy/
https://www.zillow.com/b/jackson-arms-apartments-hayward-ca-5XxZLv/
https://www.zillow.com/b/elan-at-river-oaks-san-jose-ca-5XjLQF/
https://www.zillow.com/homedetails/San-Francisco-CA-94108/2078592726_zpid/
https://www.zillow.com/homedetails/20914-Cato-Ct-Castro-Valley-CA-94546/2068418792_zpid/
https://www.zillow.com/homedetails/1240-21st-Ave-3-San-Francisco-CA-94122/2068418798_zpid/
https://www.zillow.com/homedetails/1246-Walker-Ave-APT-207-Walnut-Creek-CA-94596/18413629_zpid/
https://www.zillow.com/b/the-presidio-fremont-ca-5Xk3QQ/
https://www.zillow.com/homedetails/1358-Noe-St-1-San-Francisco-CA-94131/2068418857_zpid/
https://www.zillow.com/b/the-estates-at-park-place-fremont-ca-5XjVpg/
https://www.zillow.com/homedetails/2060-Camel-Ln-Walnut-Creek-CA-94596/2093645611_zpid/
https://www.zillow.com/b/840-van-ness-san-francisco-ca-5YCwMj/
https://www.zillow.com/homedetails/285-Grand-View-Ave-APT-6-San-Francisco-CA-94114/2095256302_zpid/
https://www.zillow.com/homedetails/929-Oak-St-APT-3-San-Francisco-CA-94117/2104800238_zpid/
https://www.zillow.com/homedetails/420-N-Civic-Dr-APT-303-Walnut-Creek-CA-94596/18410162_zpid/
https://www.zillow.com/homedetails/1571-Begen-Ave-Mountain-View-CA-94040/19533010_zpid/
https://www.zillow.com/homedetails/145-Woodbury-Cir-D-Vacaville-CA-95687/2068419093_zpid/
https://www.zillow.com/b/trinity-towers-apartments-san-francisco-ca-5XjPdR/
https://www.zillow.com/b/hidden-creek-vacaville-ca-5XjV3h/
https://www.zillow.com/homedetails/19-Belle-Ave-APT-7-San-Anselmo-CA-94960/2081212106_zpid/
https://www.zillow.com/homedetails/1560-Jackson-St-APT-11-Oakland-CA-94612/2068419279_zpid/
https://www.zillow.com/homedetails/1465-Marchbanks-Dr-APT-2-Walnut-Creek-CA-94598/18382713_zpid/
https://www.zillow.com/homedetails/205-Morning-Sun-Ave-B-Mill-Valley-CA-94941/2077904048_zpid/
https://www.zillow.com/homedetails/1615-Pacific-Ave-B-Alameda-CA-94501/2073535331_zpid/
https://www.zillow.com/homedetails/409-S-5th-St-1-San-Jose-CA-95112/2078856409_zpid/
https://www.zillow.com/homedetails/5635-Anza-St-P5G3CZYNW-San-Francisco-CA-94121/2068419581_zpid/
https://www.zillow.com/b/407-fairmount-ave-oakland-ca-9NTzMK/
https://www.zillow.com/homedetails/1940-Buchanan-St-A-San-Francisco-CA-94115/15075413_zpid/
https://www.zillow.com/homedetails/2380-California-St-QZ6SFATJK-San-Francisco-CA-94115/2078197750_zpid/
https://www.zillow.com/homedetails/1883-Agnew-Rd-UNIT-241-Santa-Clara-CA-95054/79841436_zpid/
https://www.zillow.com/b/marina-playa-santa-clara-ca-5XjKBc/
https://www.zillow.com/b/birch-creek-mountain-view-ca-5XjKKB/
https://www.zillow.com/homedetails/969-Clark-Ave-D-Mountain-View-CA-94040/2068419946_zpid/
https://www.zillow.com/homedetails/74-Williams-St-San-Leandro-CA-94577/24879175_zpid/

The problem is the website. It adds the links dynamicaly, so you can try scrolling to the bottom of the page and than searching for the links.
bottomFooter = driver.find_element_by_id("region-info-footer")
driver.execute_script("arguments[0].scrollIntoView();", bottomFooter)

How to scrape data on a subpage of a website?

Here's the website : website
And here's my script :
from selenium import webdriver
import time
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from selenium.webdriver.common.keys import Keys
#chemin du folder ou vous avez placer votre chromedriver
PATH = "driver\chromedriver.exe"
options = webdriver.ChromeOptions()
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1200,900")
options.add_argument('enable-logging')
#driver = webdriver.Chrome(options=options, executable_path=PATH)
url = 'https://www.booking.com/hotel/fr/d-argentine.fr.html?label=gen173nr-1DEgdyZXZpZXdzKIICOOgHSDNYBGhNiAEBmAENuAEXyAEM2AED6AEBiAIBqAIDuAKr2vuGBsACAdICJDE1YjBlZDY1LTI2NzEtNGM3Mi04OWQ1LWE5MjQ3OWFmNzE2NtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city;dist=0;group_adults=2;group_children=0;hapos=1;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;srepoch=1625222475;srpvid=48244b2523010057;type=total;ucfs=1&#tab-main'
driver = webdriver.Chrome(options=options, executable_path=PATH)
driver.get(url)
driver.maximize_window()
time.sleep(2)
headers= {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
cookie = driver.find_element_by_xpath('//*[#id="onetrust-accept-btn-handler"]')
try:
cookie.click()
except:
pass
time.sleep(2)
country = driver.find_element_by_xpath('//*[#class="hp_nav_reviews_link toggle_review track_review_link_zh"]')
country.click()
time.sleep(2)
url2 = driver.current_url
commspos = []
commsneg = []
header = []
notes = []
dates = []
datestostay = []
results = requests.get(url2, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
reviews = soup.find_all('li', class_ = "review_item clearfix")
for review in reviews:
try:
commpos = review.find("p", class_ = "review_pos").text.strip()
except:
commpos = 'NA'
commspos.append(commpos)
try:
commneg = review.find("p", class_ = "review_neg").text.strip()
except:
commneg = 'NA'
commsneg.append(commneg)
head = review.find('div', class_ = 'review_item_header_content').text.strip()
header.append(head)
note = review.find('span', class_ = 'review-score-badge').text.strip()
notes.append(note)
date = review.find('p', class_ = 'review_item_date').text[23:].strip()
dates.append(date)
try:
datestay = review.find('p', class_ = 'review_staydate').text[20:].strip()
datestostay.append(datestay)
except:
datestostay.append('NaN')
data = pd.DataFrame({
'commspos' : commspos,
'commsneg' : commsneg,
'headers' : header,
'notes' : notes,
'dates' : dates,
'datestostay' : datestostay,
})
data.to_csv('dftest.csv', sep=';', index=False, encoding = 'utf_8_sig')
My script goes here :
But my csv file as output is empty. I assume it has to do with some kind of javascript. I already encounter that, the script goes to the subpage but isn't really into the subpage, hence doesn't have access to the html part and give nothing in the output.

That sub-page that you want loads the data from this URL.
https://www.booking.com/hotelfeaturedreviews/fr/d-argentine.fr.html?label=gen173nr-1DEgdyZXZpZXdzKIICOOgHSDNYBGhNiAEBmAENuAEXyAEM2AED6AEBiAIBqAIDuAKr2vuGBsACAdICJDE1YjBlZDY1LTI2NzEtNGM3Mi04OWQ1LWE5MjQ3OWFmNzE2NtgCBOACAQ;sid=22417257c7da25395d270bcc7c6ec2e8;dest_id=-1456928;dest_type=city;group_adults=2;group_children=0;hapos=1;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;srepoch=1625222475;srpvid=48244b2523010057;type=total;ucfs=1&&_=1625476042625
You can easily scrape this page and extract the data you need.

retrieve all car links from dynamic page

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("--user-agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'")
#options.add_argument("headless")
driver=webdriver.Chrome(executable_path="/home/timmy/Python/chromedriver",chrome_options=options)
url="https://turo.com/search?country=US&defaultZoomLevel=7&endDate=03%2F20%2F2019&endTime=10%3A00&international=true&isMapSearch=false&itemsPerPage=200&location=Colorado%2C%20USA&locationType=City&maximumDistanceInMiles=30&northEastLatitude=41.0034439&northEastLongitude=-102.040878&region=CO&sortType=RELEVANCE&southWestLatitude=36.992424&southWestLongitude=-109.060256&startDate=03%2F15%2F2019&startTime=10%3A00"
driver.get(url)
list_of_all_car_links=[]
x=0
while True:
html=driver.page_source
soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all("a", href=True):
if i['href'].startswith("/rentals") and len(i['href']) > 31 :
link2="https://turo.com"+i['href']
list_of_all_car_links.append(link2)
try:
x=scrolldown(last_height=x)
except KeyError:
#driver.close()
break
i tried scolling down and then finding links but i only got part here is my scroll down function:
def scrolldown(last_height=0,SCROLL_PAUSE_TIME=3,num_tries = 2):
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
new_height = driver.execute_script("return document.body.scrollHeight")
# break condition
if last_height == new_height:
#print("hello")
num_tries-=1
if num_tries==0:
print("Reached End of page")
raise KeyError
else:
scrolldown(last_height=last_height, SCROLL_PAUSE_TIME=2,num_tries=num_tries)
return new_height
I also tried converting html after each scroll to BeautifulSoup then find the links but didn't get all links.
what i want is to get every car link in that page.

I would use requests and the API shown in the xhr list in dev tools. Note the items per page parameter in the query string itemsPerPage=200. You can try altering this for larger result sets.
import requests
url = 'https://turo.com/api/search?country=US&defaultZoomLevel=7&endDate=03%2F20%2F2019&endTime=10%3A00&international=true&isMapSearch=false&itemsPerPage=200&location=Colorado%2C%20USA&locationType=City&maximumDistanceInMiles=30&northEastLatitude=41.0034439&northEastLongitude=-102.040878&region=CO&sortType=RELEVANCE&southWestLatitude=36.992424&southWestLongitude=-109.060256&startDate=03%2F15%2F2019&startTime=10%3A00'
baseUrl = 'https://turo.com'
headers = {'Referer' : 'https://turo.com/search?country=US&defaultZoomLevel=7&endDate=03%2F20%2F2019&endTime=10%3A00&international=true&isMapSearch=false&itemsPerPage=200&location=Colorado%2C%20USA&locationType=City&maximumDistanceInMiles=30&northEastLatitude=41.0034439&northEastLongitude=-102.040878&region=CO&sortType=RELEVANCE&southWestLatitude=36.992424&southWestLongitude=-109.060256&startDate=03%2F15%2F2019&startTime=10%3A00',
'User-Agent' : 'Mozilla/5.0'}
r = requests.get(url, headers = headers).json()
results = []
for item in r['list']:
results.append(baseUrl + item['vehicle']['url'])
print(results)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get href with BeautifulSoup - python

When accessing it try with: print('bs4 turn \n') result = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml') rows = result.find_all('tr') links = [] real_link = [] for row in rows: a = row.find("a", href=True) links.append(a['href']) for each in links: print(each)

Related

Why does my Code using Selenium has such a long iteration time in a for-loop in Python? (Chromedriver)

Page source not keeping up with updates made to the page by Selenium

Selenium doesn't return all elements required

How to scrape data on a subpage of a website?

retrieve all car links from dynamic page

Categories

Resources