Web Scraping Python - Pubs - python

I am trying to extract the site name and address data from this website for each card but this doesn't seem to work. Any suggestions?
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://order.marstons.co.uk/")
all_cards = driver.find_elements_by_xpath("//div[#class='h3.body__heading']/div[1]")
for card in all_cards:
print(card.text) # do as you will

I'm glad that you are trying to help yourself, it seems you are new to this so let me offer some help.
Automating a browser via Selenium to do this is going to take you forever, the Marston's site is pretty straightforward to scrape if you know where to look: If you open your browser Developer Tools (F12 on pc) then - Network tab - fetch/Xhr and then hit refresh while on the Marston's site you'll see some backend api calls happening. If you click on the one that says "brand" then click the "preview" tab that should be available, you'll see a collapsible list of all sorts of information, that is a JSON file which is essentially a collection of python lists and dictionaries which make it easier to get the data you are after. The information in the "venue" list is going to be helpful when it comes to scraping the menus for each venue.
When you go to a specific pub you'll see an api call with the pubs name, this has all the menu info which you can see in the same way and we can make calls to these venue api's using the "slug" data from the venues response above.
So by making our own requests to these URLs and stepping through the JSON and getting the data we want we can have everything done in a couple minutes, far easier than trying to do this automating a browser! I've written the code below, feel free to ask questions if anything is unclear you'll need to pip install requests and pandas to make this work. You owe me a pint! :) Cheers
import requests
import pandas as pd
headers = {'origin':'https://order.marstons.co.uk'}
url = 'https://api-cdn.orderbee.co.uk/brand'
resp = requests.get(url,headers=headers).json()
venues = {}
for venue in resp['venues']:
venues[venue['slug']] = venue
print(f'{len(venues)} venues to scrape')
output = []
for venue in venues.keys():
try:
url = f'https://api-cdn.orderbee.co.uk/venues/{venue}'
print(f'Scraping: {venues[venue]["name"]}')
try:
info = requests.get(url,headers=headers).json()
except Exception as e:
print(e)
print(f'{venues[venue]["name"]} not available')
continue
for category in info['menus']['oat']['categories']: #oat = order at table?
cat_name = category['name']
for subcat in category['subCategories']:
subcat_name = subcat['name']
for item in subcat['items']:
info = {
'venue_name': venues[venue]['name'],
'venue_city': venues[venue]['address']['city'],
'venue_address': venues[venue]['address']['streetAddress'],
'venue_postcode': venues[venue]['address']['postCode'],
'venue_latlng': venues[venue]['address']['location']['coordinates'],
'category':cat_name,
'subcat':subcat_name,
'item_name' : item['name'],
'item_price' : item['price'],
'item_id' : item['id'],
'item_sku' : item['sku'],
'item_in_stock' : item['inStock'],
'item_active' : item['isActive'],
'item_last_update': item['updatedAt'],
'item_diet': item['diet']
}
output.append(info)
except Exception as e:
print(f'Problem scraping {venues[venue]["name"]}, skipping it') #when there is no menu available for some reason? Closed location?
continue
df = pd.DataFrame(output)
df.to_csv('marstons_dump.csv',index=False)

I use Firefox but it should work also for Chrome.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# driver = webdriver.Chrome(ChromeDriverManager().install())
driver = webdriver.Firefox()
driver.get("https://order.marstons.co.uk/")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '//*[#id="app"]/div/div/div/div[2]/div'))
).find_elements_by_tag_name('a')
for el in element:
print("heading", el.find_element_by_tag_name('h3').text)
print("address", el.find_element_by_tag_name('p').text)
finally:
driver.quit()

Related

How to scrape news articles from cnbc with keyword "Green hydrogen"?

I am trying to scrap news article listed in this url, all article are in span.Card-title. But this gives blank output. Is there any to resolve this?
from bs4 import BeautifulSoup as soup
import requests
cnbc_url = "https://www.cnbc.com/search/?query=green%20hydrogen&qsearchterm=green%20hydrogen"
html = requests.get(cnbc_url)
bsobj = soup(html.content,'html.parser')
day = bsobj.find(id="root")
print(day.find_all('span',class_='Card-title'))
for link in bsobj.find_all('span',class_='Card-title'):
print('Headlines : {}'.format(link.text))
The problem is that content is not present on page when it loads initially, only afterwards is it fetched from server using url like this
https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=green%20hydrogen&endindex=0&batchsize=10&callback=&showfaceted=false&timezoneoffset=-240&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&needtoptickers=1&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28
and added to page.
Take a look at /json.aspx endpoint in devtools, data seems to be there.
As mentioned in another answer, the data about the articles are loaded using another link, which you can find via the networks tab in devtools. [In chrome, you can open devtools with Ctrl+Shift+I, then go to the networks tab to see requests made, and then click on the name starting with 'json.aspx?...' to see details, then copy the Request URL from Headers section.]
Once you have the Request URL, you can copy it and make the request in your code to get the data:
# dataReqUrl contains the copied Request URL
dataReq = requests.get(dataReqUrl)
for r in dataReq.json()['results']: print(r['cn:title'])
If you don't feel like trying to find that one request in 250+ other requests, you might also try to assemble a shorter form of the url with something like:
# import urllib.parse
# find link to js file with api key
jsLinks = bsobj.select('link[href][rel="preload"]')
jUrl = [m.get('href') for m in jsLinks if 'main' in m.get('href')][0]
jRes = requests.get(jUrl) # request js file api key
# get api key from javascript
qKey = jRes.text.replace(' ', '').split(
'QUERYLY_KEY:'
)[-1].split(',')[0].replace('"', '').strip()
# form url
qParams = {
'queryly_key': qKey,
'query': search_for, # = 'green hydrogen'
'batchsize': 10 # can go up to 100 apparently
}
qUrlParams = urllib.parse.urlencode(qParams, quote_via=urllib.parse.quote)
dataReqUrl = f'https://api.queryly.com/cnbc/json.aspx?{qUrlParams}'
Even though the assembled dataReqUrl is not identical to the copied one, it seems to be giving the same results (I checked with a few different search terms). However, I don't know how reliable this method is, especially compared to the much less convoluted approach with selenium:
# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
# define chromeDriver_path <-- where you saved 'chromedriver.exe'
cnbc_url = "https://www.cnbc.com/search/?query=green%20hydrogen&qsearchterm=green%20hydrogen"
driver = webdriver.Chrome(chromeDriver_path)
driver.get(cnbc_url)
ctSelector = 'span.Card-title'
WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located(
(By.CSS_SELECTOR, ctSelector)))
cardTitles = driver.find_elements(By.CSS_SELECTOR, ctSelector)
cardTitles_text = [ct.get_attribute('innerText') for ct in cardTitles]
for c in cardTitles_text: print(c)
In my opinion, this approach is more reliable as well as simpler.

how to skip element (continue looping) or filled it with certain value if the element doesn't exist? Selenium Python

I'm sorry for my terrible English. I'm kinda new to Python. I would like to know, how to skip for loop process if Web element does not exists or fill it with other value? I've been trying to scrape youtube channel to get the title, views, and when video posted. My code looks like this:
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome import service
from selenium.webdriver.common.keys import Keys
import time
import wget
import os
import pandas as pd
import matplotlib.pyplot as plt
urls = [
'https://www.youtube.com/c/LofiGirl/videos',
'https://www.youtube.com/c/Miawaug/videos'
]
for url in urls:
PATH = 'C:\webdrivers\chromedriver.exe.'
driver = webdriver.Chrome(PATH)
driver.get(url)
#driver.maximize_window()
driver.implicitly_wait(10)
for i in range(10):
driver.find_element(By.TAG_NAME, "Body").send_keys(Keys.END)
driver.implicitly_wait(20)
time.sleep(5)
judul_video = []
viewers = []
tanggal_posting = []
titles = driver.find_elements(By.XPATH, "//a[#id='video-title']")
views = driver.find_elements(By.XPATH, "//div[#id='metadata-line']/span[1]")
DatePosted = driver.find_elements(By.XPATH, "//div[#id='metadata-line']/span[2]")
for title in titles:
judul_video.append(title.text)
driver.implicitly_wait(5)
for view in views:
viewers.append(view.text)
driver.implicitly_wait(5)
for posted in DatePosted:
tanggal_posting.append(posted.text)
driver.implicitly_wait(5)
vid_item = {
"video_title" : judul_video,
"views" : viewers,
"date_posted" : tanggal_posting
}
df = pd.DataFrame(vid_item, columns=["video_title", "views", "date_posted"])
#df_new = df.transpose()
print(df)
filename = url.split('/')[-2]
df.to_csv(rf"C:\Users\.......\YouTube_{filename}.csv", sep=",")
driver.quit()
That code works good, but at this code:
for posted in DatePosted:
tanggal_posting.append(posted.text)
driver.implicitly_wait(5)
when, some channel doing a live streaming, such as lofi Girl, I've got an error said "All arrays must be of the same length". Apparently, I had failed to create if else condition to fill streaming channel with other value such as Tanggal_posting.append("Live Stream") or else, or just skip entirely extraction data start from the title. This code below are trying to skip or filled with other value, but failed:
for posted in DatePosted:
if len(posted.text) > 0:
tanggal_posting.append(posted.text)
driver.implicitly_wait(5)
else:
tanggal_posting.append("Live")
driver.implicitly_wait(5)
How can I skip the iteration just for a single video that shown doing Live Stream? or how can I fill the value with other value such as "Live Stream" by using if else condition as I mention before? Thank you so much in Advance.
Personally, I'd check first if posted is viable for a .text attribute call.
for posted in DatePosted:
_posted = posted.text.strip() if posted else None
tanggal_posting.append(_posted if _posted else "Live")
driver.implicitly_wait(5)
Alternatively:
for posted in DatePosted:
_posted = posted.text.strip() if posted else None
if not _posted:
continue
tanggal_posting.append(_posted)
driver.implicitly_wait(5)
The overall code should differ depending on your objective. Though I suppose _posted will be helpful in any of them.
Instead of collecting 3 separate lists for each data item I'd suggest to get list of videos and then extract each item and handle then:
videos = driver.find_elements(By.XPATH, "//div[#id='items']/ytd-grid-video-renderer")
for video in videos:
if not video.find_elements(By.XPATH, ".//yt-icon"): # Check if no Streaming icon
title = video.find_element(By.XPATH, ".//a[#id='video-title']")
view = video.find_element(By.XPATH, ".//div[#id='metadata-line']/span[1]")
DatePosted = video.find_element(By.XPATH, ".//div[#id='metadata-line']/span[2]")
Note that you need to call driver.implicitly_wait(<SECONDS>) only ONCE at the beginning of script!

How to download PDF from url in python

Note: This is very different problem compared to other SO answers (Selenium Webdriver: How to Download a PDF File with Python?) available for similar questions.
This is because The URL: https://webice.ongc.co.in/pay_adv?TRACKNO=8262# does not directly return the pdf but in turn makes several other calls and one of them is the url that returns the pdf file.
I want to be able to call the url with a variable for the query param TRACKNO and to be able to save the pdf file using python.
I was able to do this using selenium, but my code fails to work when the browser is used in headless mode and I need it to work in headless mode. The code that I wrote is as follows:
import requests
from urllib3.exceptions import InsecureRequestWarning
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
def extract_url(driver):
advice_requests = driver.execute_script("var performance = window.performance || window.mozPerformance || window.msPerformance || window.webkitPerformance || {}; var network = performance.getEntries() || {}; return network;")
print(advice_requests)
for request in advice_requests:
if(request.get('initiatorType',"") == 'object' and request.get('entryType',"") == 'resource'):
link_split = request['name'].split('-')
if(link_split[-1] == 'filedownload=X'):
print("..... Successful")
return request['name']
print("..... Failed")
def save_advice(advice_url,tracking_num):
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
response = requests.get(advice_url,verify=False)
with open(f'{tracking_num}.pdf', 'wb') as f:
f.write(response.content)
def get_payment_advice(tracking_nums):
options = webdriver.ChromeOptions()
# options.add_argument('headless') # DOES NOT WORK IN HEADLESS MODE SO COMMENTED OUT
driver = webdriver.Chrome(options=options)
for num in tracking_nums:
print(num,end=" ")
driver.get(f'https://webice.ongc.co.in/pay_adv?TRACKNO={num}#')
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'ls-highlight-domref')))
time.sleep(0.1)
advice_url = extract_url(driver)
save_advice(advice_url,num)
except:
pass
driver.quit()
get_payment_advice['8262']
As it can be seen I get all the network calls that the browser makes in the first line of the extract_url function and then parse each request to find the correct one. However this does not work in headless mode
Is there any other way of doing this as this seems like a workaround? If not, can this be fixed to work in headless mode?
I fixed it, i only changed one function. The correct url is in the given page_source of the driver (with beautifulsoup you can parse html, xml etc.):
from bs4 import BeautifulSoup
def extract_url(driver):
soup = BeautifulSoup(driver.page_source, "html.parser")
object_element = soup.find("object")
data = object_element.get("data")
return f"https://webice.ongc.co.in{data}"
The hostname part may can be extracted from the driver.
I think i did not changed anything else, but if it not work for you, I can paste the full code.
Old Answer:
if you print the text of the returned page (print(driver.page_source)) i think you would get a message that says something like:
"Because of your system configuration the pdf can't be loaded"
This is because the requested site checks some preferences to decide if you are a roboter or not. Maybe it helps to change some arguments (screen size, user agent) to fix this. Here are some information about, how to detect a headless browser.
And for the next time you should paste all relevant code into the question (imports) to make it easier to test.

How do I make the driver navigate to new page in selenium python

I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class

Scraping dynamic DataTable of many pages but same URL

I have experience with C and I'm starting to approach Python, mostly for fun.
I am trying to scrape this page here https://www.justetf.com/it/find-etf.html?groupField=index&from=search&/it/find-etf.html%3F1-1.0-esearch-etfsPanel.
Since the table, with the content I'm interested on, is dynamically created after connecting to the page, I'm using:
Selenium to load the page in the browser
Beautiful soup 4 for scraping the data loaded
At the moment I'm able to scrape all the fields of interest of the first 25 entries, the ones which are loaded once connected to the page. I can have up to 100 entries in one page but there are 1045 entries in total, which are split in different pages. The problem is that the url is the same for all the pages and the content of the table is dynamically loaded at runtime.
What I would like to do is find a way to be able to scrape all the entries, which are 1045. Reading through the internet I have understood I should send a proper POST request (I've also founded that they are retrieving data from https://www.finanztreff.de/) from my code, get the data from the response and scrape it.
I can see two possibilities :
Retrieve all the entries in once
Retrieve one page after the other and scrape one after the other
I have no idea how to build up the POST request.
I think there is no need to post the code but if needed I can re-edit the question.
Thanks in advance to everybody.
EDITED
Here you go with some code
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from bs4 import BeautifulSoup
import requests
firefox_binary = FirefoxBinary('some path\\firefox.exe')
browser = webdriver.Firefox(firefox_binary=firefox_binary)
url = "https://www.justetf.com/it/find-etf.html"
browser.get(url)
delay = 5 # seconds
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'Alerian')))
print("Page is ready!")
except TimeoutException:
print ("Loading took too much time!")
page_source = browser.page_source
soup = BeautifulSoup(page_source, 'lxml')
from here on I just play a bit with bs4 APIs.
This should do the trick (getting all the data at once):
import requests as r
link = 'https://www.justetf.com/it/find-etf.html?groupField=index&from=search&/it/find-etf.html%3F1-1.0-esearch-etfsPanel'
link2 = 'https://www.justetf.com/servlet/etfs-table'
data = {
'draw': 1,
'start': 0,
'length': 10000000,
'lang': 'it',
'country': 'DE',
'universeType': 'private',
'etfsParams': link.split('?')[1]
}
res = r.post(link2, data=data)
result = res.json()
print(len(result["data"]))
EDIT: For the explanation, I did open network tab in chrome and click on the next pages to see what requests have been made, and I noticed that a POST requests was made to link2 with a lot of parameters and most were mandatory.
For the needed parameters, draw I only needed one draw (one request), start starting from position 0, length I used a big number to scrape everything at once. If length was for example 10, you'd need a lot of draws, they go like draw=2&start=10&length=10, draw=3&start=20&length=10 and so on. For lang, country and universeType I didn't know the exact use but removing them would reject the request. And last the etfsParams is what comes after '?' in link.

Categories