I've created a script in python with selenium to scrape the website address located within Contact details in a website. However, the problem is there is no url associated with that link (I can click on that link, though).
How can I parse the website link located within Contact details?
from selenium import webdriver
URL = 'https://www.truelocal.com.au/business/vitfit/sydney'
def get_website_link(driver,link):
driver.get(link)
website = driver.find_element_by_css_selector("[ng-class*='getHaveSecondaryWebsites'] > span").text
print(website)
if __name__ == '__main__':
driver = webdriver.Chrome()
try:
get_website_link(driver,URL)
finally:
driver.quit()
When I run the script, I get the visible text associate with that link which is Visit website.
Element with "Visit website" text is a span, that has vm.openLink(vm.getReadableUrl(vm.getPrimaryWebsite()),'_blank') javascript and not actual href.
My suggestion, if your goal is to scrape and not testing, you can use solution below with requests package to get data as json and extract any information you need.
Another one is actually click, as you did.
import requests
import re
headers = {
'Referer': 'https://www.truelocal.com.au/business/vitfit/sydney',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/73.0.3683.75 Safari/537.36',
'DNT': '1',
}
response = requests.get('https://www.truelocal.com.au/www-js/configuration.constant.js?v=1552032205066',
headers=headers)
assert response.ok
# extract token from response text
token = re.search("token:\\s'(.*)'", response.text)[1]
headers['Accept'] = 'application/json, text/plain, */*'
headers['Origin'] = 'https://www.truelocal.com.au'
response = requests.get(f'https://api.truelocal.com.au/rest/listings/vitfit/sydney?&passToken={token}', headers=headers)
assert response.ok
# use response.text to get full json as text and see what information can be extracted.
contact = response.json()["data"]["listing"][0]["contacts"]["contact"]
website = list(filter(lambda x: x["type"] == "website", contact))[0]["value"]
print(website)
print("the end")
Related
The Situation
I am trying to scrape webpages to get some data.
I need the html data which is viewable in the browser as a whole for my application.
The Problem
But when I scrape some urls, I am getting data which are not viewable from browser. But in the html code its there. So is there any way to scrape the data which is viewable only in the browser
Code
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
options.add_argument("--headless")
service = Service("/home/nebu/selenium_drivers/chromedriver")
URL = "https://augustasymphony.com/event/top-of-the-world/"
try:
driver = webdriver.Chrome(service = service, options = options)
driver.get(URL)
driver.implicitly_wait(2)
html_content = driver.page_source
driver.quit()
except WebDriverException:
driver.quit()
soup = BeautifulSoup(html_content)
for each in ['header','footer']:
s = soup.find(each)
if s == None:
continue
else:
s.extract()
text = soup.getText(separator=u' ')
print(text)
The Question
Where am I going wrong here?
How can I go about debugging this?
This is simply a case of you needing to extract the data in a more specific manner.
You have 2 options really:
Option 1: (In my opinion the better, as it is faster and less resource heavy.)
import requests
from bs4 import BeautifulSoup as bs
headers = {'Accept': '*/*',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683 Safari/537.36 OPR/57.0.3098.91'}
res = requests.get("https://augustasymphony.com/event/top-of-the-world/", headers=headers)
soup = bs(res.text, "lxml")
event_header = soup.find("h2", {"class": "rhino-event-header"}).text.strip()
time = soup.find("p", {"class": "rhino-event-time"}).text.strip()
You can use requests quite simply to find the data as shown in the code above specifically selecting the data you want and perhap saving it in a dictionary. This is the normal way to go about it. It may contain a lot of scripts in the page, however the page doesn't require JavaScript to load said data dynamically.
Option2:
You continue using selenium and can collect the entire body information of the page using one of multiple selections.
driver.find_element_by_id('wrapper').get_attribute('innerHTML') # Entire body
driver.find_element_by_id('tribe-events').get_attribute('innerHTML') # the events list
driver.find_element_by_id('rhino-event-single-content').get_attribute('innerHTML') # the single event
This second option is a lot more just taking the whole html and dumping it.
Personally I would go with the first option creating dictionaries of the cleaned data.
Edit:
To futher illustrate my example
import requests
from bs4 import BeautifulSoup as bs
headers = {'Accept': '*/*',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683 Safari/537.36 OPR/57.0.3098.91'}
res = requests.get("https://augustasymphony.com/event/", headers=headers)
soup = bs(res.text, "lxml")
seedlist = {a["href"] for a in soup.find("div", {"id": "tribe-events-content-wrapper"}).find_all("a") if '?ical=1' not in a["href"]}
for seed in seedlist:
res = requests.get(seed, headers=headers)
soup = bs(res.text, "lxml")
data = dict()
data['event_header'] = soup.find("h2", {"class": "rhino-event-header"}).text.strip()
data['time'] = soup.find("p", {"class": "rhino-event-time"}).text.strip()
print(data)
Here I am generting a seedlist of event urls and then going into each one to find information.
It's because some websites detect if it's a web browser.
So they don't send the HTML file back.
That's why there is no HTML send back
I'm trying to scrape different agency name from the second page of a webpage using requests module. I can parse the names from it's landing page by sending a get requests to the very url.
However, when it comes to access the names from it's second page and latter, I need to send post http requests along with appropriate parameters. I tried to mimic the post requests exactly the way I see it in dev tools but all I get in return is the following:
<?xml version='1.0' encoding='UTF-8'?>
<partial-response id="j_id1"><redirect url="/ptn/exceptionhandler/sessionExpired.xhtml"></redirect></partial-response>
This is how I've tried:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
link = 'https://www.gebiz.gov.sg/ptn/opportunity/BOListing.xhtml?origin=menu'
url = 'https://www.gebiz.gov.sg/ptn/opportunity/BOListing.xhtml'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {
'contentForm': 'contentForm',
'contentForm:j_idt171_windowName': '',
'contentForm:j_idt187_listButton2_HIDDEN-INPUT': '',
'contentForm:j_idt192_searchBar_INPUT-SEARCH': '',
'contentForm:j_idt192_searchBarList_HIDDEN-SUBMITTED-VALUE': '',
'contentForm:j_id135_0': 'Title',
'contentForm:j_id135_1': 'Document No.',
'contentForm:j_id136': 'Match All',
'contentForm:j_idt853_select': 'ON',
'contentForm:j_idt859_select': '0',
'javax.faces.ViewState': soup.select_one('input[name="javax.faces.ViewState"]')['value'],
'javax.faces.source': 'contentForm:j_idt902:j_idt955_2_2',
'javax.faces.partial.event': 'click',
'javax.faces.partial.execute': 'contentForm:j_idt902:j_idt955_2_2 contentForm:j_idt902',
'javax.faces.partial.render': 'contentForm:j_idt902:j_idt955 contentForm dialogForm',
'javax.faces.behavior.event': 'action',
'javax.faces.partial.ajax': 'true'
}
s.headers['Referer'] = 'https://www.gebiz.gov.sg/ptn/opportunity/BOListing.xhtml?origin=menu'
s.headers['Faces-Request'] = 'partial/ajax'
s.headers['Origin'] = 'https://www.gebiz.gov.sg'
s.headers['Host'] = 'www.gebiz.gov.sg'
s.headers['Accept-Encoding'] = 'gzip, deflate, br'
res = s.post(url,data=payload,allow_redirects=False)
# soup = BeautifulSoup(res.text,"lxml")
# for item in soup.select(".commandLink_TITLE-BLUE"):
# print(item.get_text(strip=True))
print(res.text)
How can I parse names from a webpage from it's second page when the url remains unchanged?
You can use Selenium to traverse between pages. The following code will allow you to do this.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.options import Options
import time
chrome_options = Options()
#chrome_options.add_argument("--headless")
#chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36")
driver = webdriver.Chrome(executable_path="./chromedriver", options=chrome_options)
driver.get("https://www.gebiz.gov.sg/ptn/opportunity/BOListing.xhtml?origin=menu")
#check if next page exists
next_page = driver.find_element_by_xpath("//input[starts-with(#value, 'Next')]")
#click the next button
while next_page is not None:
time.sleep(5)
click_btn = driver.find_element_by_xpath("//input[starts-with(#value, 'Next')]")
click_btn.click()
time.sleep(5)
next_page = driver.find_element_by_xpath("//input[starts-with(#value, 'Next')]")
I have not added the code for extracting the Agency names. I presume it will not be difficult for you.
Make sure to install Selenium and download the chrome driver. Also make sure to download the correct version of the driver. You can confirm the version by viewing the 'About' section of your chrome browser.
I'm trying to scrape the product links under ZIP code 08041. I have written the code to scrape the products without ZIP code but don't know how to scrape and send the request fror the products under 08041?
Here is my code:
import requests
import random
import time
from bs4 import BeautifulSoup
import wget
import csv
from fp.fp import FreeProxy
def helloworld(url):
r = requests.get(url)
print ('Status',r.status_code)
#time.sleep(8)
soup = BeautifulSoup(r.content,'html.parser')
post = soup.find_all('a',"name")
for href in post:
if ( href.get('href')[1] == 'p'):
href = href.get('href')
print (href)
def page_counter():
url1 = "https://soysuper.com/c/aperitivos#products"
print (url1,'\n')
helloworld(url1)
page_counter()
You can use the back-end end-points to mimic a request with a given zip code.
Note: The cookie is hard-coded but valid for a year.
Here's how:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Cookie": "soysuper=eyJjYXJ0IjoiNjA2NWNkMzg5ZDI5YzkwNDU1NjI3MzYzIiwiZXhwaXJlcyI6MTY0ODg0MTMzOSwib3JpZCI6IkM2NzgwOUYyLTkyRUYtMTFFQi04NjNELTgzMTBCMUUwMTM2NiIsInNtIjoiIiwidXVpZCI6IkIwQjYxQzRFLTkyRUYtMTFFQi05MjRCLTA5MTFCMUUwMTM2NiIsIndoIjpbIjU0MDQ5MjEwMDk1Y2ZhNTQ2YzAwMDAwMCIsIjRmZjMwZTZhNTgzMmU0OGIwMjAwMDAwMCIsIjU5Y2JhZmE2OWRkNGU0M2JmMzIwODM0MiIsIjRmMzEyNzU4ZTNjNmIzMDAzMjAwMDAwMCIsIjVhMTZmNjdhMjUwOGMxNGFiMzE0OTY4MyIsIjYwMjQxNTEzNzIyZDZhNTZkNDZlMjhmNyIsIjRmZjMwZTJkYzI3ZTk1NTkwMjAwMDAwMSIsIjU5ZjcxYTZlNjI4YWIwN2UyYjJjZmJhMSIsIjU5Y2JhZjNjOWRkNGU0M2JmMzIwODM0MSIsIjVhMGU0NDFhNTNjOTdiM2UxNDYyOGEzNiIsIjRmMmJiZmI3ZWJjYjU1OGM3YjAwMDAwMCIsIjYwNDExZjJlNzIyZDZhMTEyZDVjYTNlYiIsIjViMWZmZjAyNzI1YTYxNzBjOTIxMjc0MSIsIjVlNzk2NWUwZDc5MTg3MGU0NTA1MGMwMCIsIjVkMTI0NDQ2OWRkNGU0NGFkMDU3MmMxMSJdLCJ6aXAiOiIwODA0MSJ9--166849121eece159a6fdb0c0fe8341032321d9b1;"
}
with requests.Session() as connection:
r = connection.get("https://soysuper.com/supermarket?zipcode=08041", headers=headers)
headers["Request-Id"] = r.headers["Next-Request-Id"]
headers["Referer"] = "https://soysuper.com/c/aperitivos"
products_data = connection.get("https://soysuper.com/c/aperitivos?products=1&page=1", headers=headers).json()
print(products_data["products"]["total"])
Output: Total number of products for 08041 zip code.
2923
What you're effectivly getting is a JSON with all the product data for a given page. This is what it looks like in the Network tab.
Do notice the pager key. Use it to "paginate" the API and get more product info.
The Situation:
The "I'm Feeling Lucky!" project in the "Automate the boring stuff with Python" ebook no longer works with the code he provided.
Specifically:
linkElems = soup.select('.r a')
What I have done:
I've already tried using the solution provided within this stackoverflow question
I'm also currently using the same search format.
Code:
import webbrowser, requests, bs4
def im_feeling_lucky():
# Make search query look like Google's
search = '+'.join(input('Search Google: ').split(" "))
# Pull html from Google
print('Googling...') # display text while downloading the Google page
res = requests.get(f'https://google.com/search?q={search}&oq={search}')
res.raise_for_status()
# Retrieve top search result link
soup = bs4.BeautifulSoup(res.text, features='lxml')
# Open a browser tab for each result.
linkElems = soup.select('.r') # Returns empty list
numOpen = min(5, len(linkElems))
print('Before for loop')
for i in range(numOpen):
webbrowser.open(f'http://google.com{linkElems[i].get("href")}')
The Problem:
The linkElems variable returns an empty list [] and the program doesn't do anything past that.
The Question:
Could sombody please guide me to he correct way of handling this and perhaps explain why it isn't working?
I too had had the same problem while reading that book and found a solution for that problem.
replacing
soup.select('.r a')
with
soup.select('div#main > div > div > div > a')
will solve that issue
following is the code that will work
import webbrowser, requests, bs4 , sys
print('Googling...')
res = requests.get('https://google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
linkElems = soup.select('div#main > div > div > div > a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
webbrowser.open('http://google.com' + linkElems[i].get("href"))
the above code takes input from commandline arguments
I took a different route. I saved the HTML from the request and opened that page, then I inspected the elements. It turns out that the page is different if I open it natively in the Chrome browser compared to what my python request is served. I identified the div with the class that appears to denote a result and supplemented that for the .r - in my case it was .kCrYT
#! python3
# lucky.py - Opens several Google Search results.
import requests, sys, webbrowser, bs4
print('Googling...') # display text while the google page is downloading
url= 'http://www.google.com.au/search?q=' + ' '.join(sys.argv[1:])
url = url.replace(' ','+')
res = requests.get(url)
res.raise_for_status()
# Retrieve top search result links.
soup=bs4.BeautifulSoup(res.text, 'html.parser')
# get all of the 'a' tags afer an element with the class 'kCrYT' (which are the results)
linkElems = soup.select('.kCrYT > a')
# Open a browser tab for each result.
numOpen = min(5, len(linkElems))
for i in range(numOpen):
webbrowser.open_new_tab('http://google.com.au' + linkElems[i].get('href'))
Different websites (for instance Google) generate different HTML codes to different User-Agents (this is how the web browser is identified by the website). Another solution to your problem is to use a browser User-Agent to ensure that the HTML code you obtain from the website is the same you would get by using "view page source" in your browser. The following code just prints the list of google search result urls, not the same as the book you've referenced but it's still useful to show the point.
#! python3
# lucky.py - Opens several Google search results.
import requests, sys, webbrowser, bs4
print('Please enter your search term:')
searchTerm = input()
print('Googling...') # display thext while downloading the Google page
url = 'http://google.com/search?q=' + ' '.join(searchTerm)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
res = requests.get(url, headers=headers)
res.raise_for_status()
# Retrieve top search results links.
soup = bs4.BeautifulSoup(res.content)
# Open a browser tab for each result.
linkElems = soup.select('.r > a') # Used '.r > a' instead of '.r a' because
numOpen = min(5, len(linkElems)) # there are many href after div class="r"
for i in range(numOpen):
# webbrowser.open('http://google.com' + linkElems[i].get('href'))
print(linkElems[i].get('href'))
There's actually no need to save the HTML file, and one of the reasons why response output is different from the one you see in the browser is that there are no headers being sent with the request, in this case, user-agent which will act as a "real" user visit (already written by Cucurucho).
When no user-agent is specified (when using requests library) it defaults to python-requests thus Google understands it, blocks a request and you receive a different HTML with different CSS selectors. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
To easier grab CSS selectors, have a look at the SelectorGadget extension to get CSS selectors by clicking on the desired element in your browser.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'how to create minecraft server',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# [:5] - first 5 results
# container with needed data: title, link, snippet, etc.
for result in soup.select('.tF2Cxc')[:5]:
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
https://minecraft.fandom.com/wiki/Tutorials/Setting_up_a_server
https://codewizardshq.com/how-to-make-a-minecraft-server/
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to spend time thinking about how to bypass blocks from Google or what is the right CSS selector to parse the data, instead, you need to pass parameters (params) you want, and iterate over structured JSON and get the data you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "how to create minecraft server",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"][:5]:
print(result["link"], sep="\n")
----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
https://minecraft.fandom.com/wiki/Tutorials/Setting_up_a_server
https://codewizardshq.com/how-to-make-a-minecraft-server/
'''
Disclaimer, I work for SerpApi.
I'm writing a Python program to extract and store metadata from interesting online tech articles: "og:title", "og:description", "og:image", og:url, and og:site_name.
This is the code I'm using...
# Setup Headers
headers = {}
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
headers['Accept-Charset'] = 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
headers['Accept-Encoding'] = 'none'
headers['Accept-Language'] = "en-US,en;q=0.8"
headers['Connection'] = 'keep-alive'
headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"
# Create the Request
http = urllib3.PoolManager()
# Create the Response
response = http.request('GET ', url, headers)
# BeautifulSoup - Construct
soup = BeautifulSoup(response.data, 'html.parser')
# Scrape <meta property="og:title" content=" x x x ">
if tag.get("property", None) == "og:title":
if len(tag.get("content", None)) > len(title):
title = tag.get("content", None)
The program runs fine on all but one site. On "forbes.com", I can't get to the articles using Python:
url=
https://www.forbes.com/consent/?toURL=https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086
I can't bypass this consent page; which seems to be the "Cookie Consent Manager" solution from "TrustArc". On a computer, you basically provide your consent... and each consecutive run, you're able to access the articles.
If I reference the "toURL" url:
https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086
And bypass the "https://www.forbes.com/consent/" page, I'm redirected back to this page.
I've tried to see if there is a cookie I could set in the header, but couldn't find the magic key.
Can anyone help me?
There is a required cookie notice_gdpr_prefs that needs to be sent to view the data :
import requests
from bs4 import BeautifulSoup
src = requests.get(
"https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/",
headers= {
"cookie": "notice_gdpr_prefs"
})
soup = BeautifulSoup(src.content, 'html.parser')
title = soup.find("meta", property="og:title")
print(title["content"])