how to get the scraped values on a website that has fingerprintjs? - python

I am trying to scrape this site using google colab.
https://www.blibli.com/p/facial-tissue-tisu-wajah-250-s-paseo/is--LO1-70001-00049-00001?seller_id=LO1-70001&sku_id=LO1-70001-00049-00001&sclid=7zuGEaS4hh5SowAA6tnfd5i2wKjR6e3p&sid=c5746ccfbb298d3b&pid=LO1-70001-00049-00001
The idea here is to get the fingerprint to be used and parsed in the requests.
currently my code is
from seleniumwire import webdriver
options = webdriver.ChromeOptions()
options.set_capability(
"goog:loggingPrefs", {"performance": "ALL", "browser": "ALL"}
)
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# open it, go to a website, and get results
driver = webdriver.Chrome('chromedriver',options=options)
dataranch=[1]
dataulrs=['https://www.blibli.com/p/facial-tissue-tisu-wajah-250-s-paseo/is--LO1-70001-00049-00001?seller_id=LO1-70001&sku_id=LO1-70001-00049-00001&sclid=7zuGEaS4hh5SowAA6tnfd5i2wKjR6e3p&sid=c5746ccfbb298d3b&pid=LO1-70001-00049-00001']
for prod_id,urls in zip(dataranch,dataulrs):
try:
driver.get(urls)
sleep(randint(3,5))
product_name=driver.find_element(By.CSS_SELECTOR, ".product-name").text
try:
normal_price=driver.find_element(By.CSS_SELECTOR, ".product-price__before").text
except:
normal_price="0"
normal_price=normal_price.replace('Rp',"").replace(".","")
try:
discount=driver.find_element(By.CSS_SELECTOR, ".product-price__discount").text
except:
discount="0"
compid=urls.split(".")[4].split("?")[0]
dat={
'product_name':product_name,
'normal_price':normal_price,
'discount':discount,
'competitor_id':compid,
'url':urls,
'prod_id':prod_id,
'date_key':today,
'web':'ranch market'
}
dat=pd.DataFrame([dat])
except Exception as e:
print(f"{urls} error")
print(e)
What am I doing wrong here? Can someone help? Because I tried inspecting the elements and the css . is there. Is there a way to scrape the data needed? Do I need to use a different module just selenium to get the data?

Related

Web Scraping Python - Pubs

I am trying to extract the site name and address data from this website for each card but this doesn't seem to work. Any suggestions?
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://order.marstons.co.uk/")
all_cards = driver.find_elements_by_xpath("//div[#class='h3.body__heading']/div[1]")
for card in all_cards:
print(card.text) # do as you will
I'm glad that you are trying to help yourself, it seems you are new to this so let me offer some help.
Automating a browser via Selenium to do this is going to take you forever, the Marston's site is pretty straightforward to scrape if you know where to look: If you open your browser Developer Tools (F12 on pc) then - Network tab - fetch/Xhr and then hit refresh while on the Marston's site you'll see some backend api calls happening. If you click on the one that says "brand" then click the "preview" tab that should be available, you'll see a collapsible list of all sorts of information, that is a JSON file which is essentially a collection of python lists and dictionaries which make it easier to get the data you are after. The information in the "venue" list is going to be helpful when it comes to scraping the menus for each venue.
When you go to a specific pub you'll see an api call with the pubs name, this has all the menu info which you can see in the same way and we can make calls to these venue api's using the "slug" data from the venues response above.
So by making our own requests to these URLs and stepping through the JSON and getting the data we want we can have everything done in a couple minutes, far easier than trying to do this automating a browser! I've written the code below, feel free to ask questions if anything is unclear you'll need to pip install requests and pandas to make this work. You owe me a pint! :) Cheers
import requests
import pandas as pd
headers = {'origin':'https://order.marstons.co.uk'}
url = 'https://api-cdn.orderbee.co.uk/brand'
resp = requests.get(url,headers=headers).json()
venues = {}
for venue in resp['venues']:
venues[venue['slug']] = venue
print(f'{len(venues)} venues to scrape')
output = []
for venue in venues.keys():
try:
url = f'https://api-cdn.orderbee.co.uk/venues/{venue}'
print(f'Scraping: {venues[venue]["name"]}')
try:
info = requests.get(url,headers=headers).json()
except Exception as e:
print(e)
print(f'{venues[venue]["name"]} not available')
continue
for category in info['menus']['oat']['categories']: #oat = order at table?
cat_name = category['name']
for subcat in category['subCategories']:
subcat_name = subcat['name']
for item in subcat['items']:
info = {
'venue_name': venues[venue]['name'],
'venue_city': venues[venue]['address']['city'],
'venue_address': venues[venue]['address']['streetAddress'],
'venue_postcode': venues[venue]['address']['postCode'],
'venue_latlng': venues[venue]['address']['location']['coordinates'],
'category':cat_name,
'subcat':subcat_name,
'item_name' : item['name'],
'item_price' : item['price'],
'item_id' : item['id'],
'item_sku' : item['sku'],
'item_in_stock' : item['inStock'],
'item_active' : item['isActive'],
'item_last_update': item['updatedAt'],
'item_diet': item['diet']
}
output.append(info)
except Exception as e:
print(f'Problem scraping {venues[venue]["name"]}, skipping it') #when there is no menu available for some reason? Closed location?
continue
df = pd.DataFrame(output)
df.to_csv('marstons_dump.csv',index=False)
I use Firefox but it should work also for Chrome.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# driver = webdriver.Chrome(ChromeDriverManager().install())
driver = webdriver.Firefox()
driver.get("https://order.marstons.co.uk/")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '//*[#id="app"]/div/div/div/div[2]/div'))
).find_elements_by_tag_name('a')
for el in element:
print("heading", el.find_element_by_tag_name('h3').text)
print("address", el.find_element_by_tag_name('p').text)
finally:
driver.quit()

How to use concurrent futures to web scrape faster? Selenium

I'm trying to scrape a list of URLs with Selenium and concurrent futures to speed up the process. I've found that I get a StaleElementReferenceException when using concurrent futures, and also the job titles do not correspond to the URLs. For instance, I get repeated job titles. When using a normal "for" I do not get this error.
I don't know what I'm doing wrong. Any help is welcomed.
My simplified code is:
import concurrent.futures
import time
from selenium import webdriver
options = webdriver.ChromeOptions()
#options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
PATH = "C:\Program Files (x86)\chromedriver.exe"
wd = webdriver.Chrome(PATH, options=options)
wd.maximize_window()
vurl = ['https://www.bumeran.com.pe/empleos/asistente-contable-exp.-en-concar-ssp-1114585777.html',
'https://www.bumeran.com.pe/empleos/asesor-a-comercial-digital-de-seguro-vehicular-1114584904.html',
'https://www.bumeran.com.pe/empleos/mecanico-de-mantenimiento-arequipa-1114585709.html',
'https://www.bumeran.com.pe/empleos/almacenero-l.o.-electronics-s.a.c.-1114585629.html',
'https://www.bumeran.com.pe/empleos/analista-de-comunicaciones-ingles-avanzado-teleperformance-peru-s.a.c.-1114564863.html',
'https://www.bumeran.com.pe/empleos/vendedores-adn-retail-s.a.c.-1114585422.html',
'https://www.bumeran.com.pe/empleos/especialista-de-intervencion-de-proyectos-mondelez-international-1114585461.html',
'https://www.bumeran.com.pe/empleos/desarrollador-java-senior-inetum-peru-1114584840.html',
'https://www.bumeran.com.pe/empleos/practicante-legal-coes-sinac-1114584788.html',
'https://www.bumeran.com.pe/empleos/concurso-publico-n-143-especialista-en-presupuesto-banco-central-de-reserva-del-peru-1114584538.html',
'https://www.bumeran.com.pe/empleos/concurso-n-147-especialista-en-analisis-de-infraestructuras-financieras-banco-central-de-reserva-del-peru-1114584444.html',
'https://www.bumeran.com.pe/empleos/asistente-legal-magdalena-del-mar-los-portales-1114584305.html',
'https://www.bumeran.com.pe/empleos/asistente-de-nuevos-negocios-inmobiliarios-madrid-ingenieros-1114584269.html',
'https://www.bumeran.com.pe/empleos/trabajo-desde-tres-horas-por-dia-ventas-ventas-por-internet-1114584205.html']
vtitle = []
def get_urls(url):
wd.get(url)
wd.implicitly_wait(20)
try:
title = wd.find_element_by_xpath("//h1").text
print('URL finished')
except:
title=''
print('Exception!')
vtitle.append(title)
vurl2.append(url)
# This throws an exception and does not scrape correctly
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
executor.map(get_urls, vurl)
# output is for example
#['ALMACENERO', 'ALMACENERO', 'ALMACENERO', 'ALMACENERO', 'Desarrollador Java (Senior)', 'Desarrollador Java (Senior)', 'Desarrollador Java (Senior)']
# when it should be:
# ['ALMACENERO', 'Analista de Comunicaciones - Inglés Avanzado', 'Vendedores', 'Especialista de Intervención de Proyectos', 'Desarrollador Java (Senior)', 'Practicante Legal', 'Asistente Legal - Magdalena del Mar']
# This works fine but is too slow
for url in vurl:
get_urls(url)

Retrieve message from DOM with Selenium

I am learning Python and I am trying things with Selenium. Today I am trying to retrieve a message from the Spectrum chat of Star Citizen: https://robertsspaceindustries.com/spectrum/community/SC/lobby/1
I would like to retrieve the: div class="lobby-motd-message" because it gives good information.
This is my code but when I run it, it displays nothing... Can you help me to solve this problem ? Please. I will do more things with Selenium ( a Discord bot) but I need to retrieve this information first.
#!/usr/bin/python3
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
opts = Options()
opts.headless = True
browser = webdriver.Firefox(options=opts)
url = "https://robertsspaceindustries.com/spectrum/community/SC/lobby/1"
browser.get(url)
browser.implicitly_wait(10)
try:
info = browser.find_element_by_class_name("lobby-motd-message")
print(info.text)
except:
print("not found")
browser.close()
quit()
Depending on what element you want you might need to target and get it's text.
try:
info = browser.find_element_by_class_name("lobby-motd-message")
print(info.find_element_by_tag_name('p').text)
except:
print("not found")
Outputs
Star Citizen Alpha 3.11 is LIVE - Discover more here !
All of it
print(info.get_attribute('innerHTML'))
Outputs
<p>Star Citizen Alpha 3.11 is LIVE - Discover more here !</p>

How to Handle Nested Loops in Selenium

I am hoping someone can help me handle nested loop in selenium. I am trying to Scrape a website using selenium, it happens that i have to scrape multiple information with different links.
So i got all the links and looped through each, but in the process, the first link only displayed the items i needed, then the code breaks.
def get_financial_info(self):
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920x1080")
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='/home/miracle/chromedriver')
driver.get("https://www.financialjuice.com")
try:
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='trendWrap']")))
except TimeoutException:
driver.quit()
category_url = driver.find_elements_by_xpath("//ul[#class='nav navbar-nav']/li[#class='text-uppercase']/a[#href]")
for record in category_url:
driver.get(record.get_attribute("href"))
news = {}
title_element = driver.find_elements_by_xpath("//p[#class='headline-title']")
for news_record in title_element:
news['title'] = news_record.text
print news
Your category_url will be valid only on page where you've defined it and after first re-direction to another page it becomes stale...
You need to replace
category_url = driver.find_elements_by_xpath("//ul[#class='nav navbar-nav']/li[#class='text-uppercase']/a[#href]")
with
category_url = [a.get_attribute("href") for a in driver.find_elements_by_xpath("//ul[#class='nav navbar-nav']/li[#class='text-uppercase']/a")]
and then loop through the list of links as
for record in category_url:
driver.get(record)

How to get URLs from website

I tried to get all URLs from this website:
https://www.bbvavivienda.com/es/buscador/venta/vivienda/todos/la-coruna/
There are a lot of links like https://www.bbvavivienda.com/es/unidades/UV_n_UV00121705 inside but I'm not able to recover them with Selenium. Any idea how to do it?
I add more info about how I tried it. obviously... i'm starting with python, selenium, etc... thanks in advance:
**from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome("D:\Python27\selenium\webdriver\chrome\chromedriver.exe")
driver.implicitly_wait(30)
driver.maximize_window()
driver.get("https://www.bbvavivienda.com/es/buscador/venta/vivienda/todos/la-coruna/")
urls=driver.find_element_by_css_selector('a').get_attribute('href')
print urls
links = driver.find_elements_by_partial_link_text('_self')
for link in links:
print link.get_attribute("href")
driver.quit()**
following code shall work. You are using the wrong identifier for the link.
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.maximize_window()
driver.get("https://www.bbvavivienda.com/es/buscador/venta/vivienda/todos/la-coruna/")
urls=driver.find_element_by_css_selector('a').get_attribute('href')
print urls
for link in driver.find_elements_by_xpath("//a[#target='_self']"):
try:
print link.get_attribute("href")
except Exception:
pass
driver.quit()
I don't know python but normally in Java we can find all the elements in the webpage having tag as "a" for finding the links in the webpage. You can find the below code snippet useful.
List<WebElement> links = driver.findElements(By.tagName("a"));
System.out.println(links.size());
for (int i = 1; i<=links.size(); i=i+1)
{
System.out.println(links.get(i).getText());
}

Categories