scraping comments from booking.com - python

I'm trying to get all the reviews from a specific hotel page in booking.com
I have tried this code but I'm not getting anything printed at all.
This is the code I tried:
import urllib.request
from bs4 import BeautifulSoup
url='https://www.booking.com/hotel/sa/sarwat-park.ar.html?aid=304142&label=gen173nr-1DCAEoggI46AdIM1gEaMQBiAEBmAERuAEHyAEM2AED6AEBiAIBqAIDuAL_oY-aBsACAdICJDE5YzYxY2ZiLWRlYjUtNDRjNC04Njk0LTlhYWY4MDkzYzNhNNgCBOACAQ&sid=c7009aac67195c0a7ef9aa63f6537581&dest_id=6376991;dest_type=hotel;dist=0;group_adults=2;group_children=0;hapos=1;hpos=1;no_rooms=1;req_adults=2;req_children=0;room1=A%2CA;sb_price_type=total;sr_order=popularity;srepoch=1665388865;srpvid=1219386046550156;type=total;ucfs=1&#tab-reviews'
req = urllib.request.Request(
url,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
}
)
f = urllib.request.urlopen(req)
soup = BeautifulSoup(f.read().decode('utf-8'), 'html.parser')
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)

To begin with, there is no class "review_item" on the entire page.
A better approach would be to just use the etree to find and get details from the xPath of the reviews list that you have now.
//*[#id="b2hotelPage"]/div[25]/div/div/div/div[1]/div[2]/div/ul
Then you could do something like
webpage = req.get(URL, headers=headers)
soup = bs(webpage.content, "html.parser")
dom = etree.HTML(str(soup))
listTarget = dom.xpath('//*[#id="b2hotelPage"]/div[25]/div/div/div/div[1]/div[2]/div/ul')
This should give you a list of lxml objects which are essentially your comment cards.
Then you can work on them in a similar fashion

Related

Parsing text with bs4 works with selenium but does not work with requests in Python

This code works and returns the single digit number that i want but its so slow and takes good 10 seconds to complete.I will be running this 4 times for my use so thats 40 seconds wasted every run.
` from selenium import webdriver
from bs4 import BeautifulSoup
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get('https://warframe.market/items/ivara_prime_blueprint')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
price_element = soup.find('div', {'class': 'row order-row--Alcph'})
price2=price_element.find('div',{'class':'order-row__price--hn3HU'})
price = price2.text
print(int(price))
driver.close()`
This code on the other hand does not work. It returns None.
` import requests
from bs4 import BeautifulSoup
url='https://warframe.market/items/ivara_prime_blueprint'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
price_element=soup.find('div', {'class': 'row order-row--Alcph'})
price2=price_element.find('div',{'class':'order-row__price--hn3HU'})
price = price2.text
print(int(price))`
First thought was to add user agent but still did not work. When I print(soup) it gives me html code but when i parse it further it stops and starts giving me None even tho its the same command like in selenium example.
The data is loaded dynamically within a <script> tag so Beautifulsoup doesn't see it (it doesn't render Javascript).
As an example, to get the data, you can use:
import json
import requests
from bs4 import BeautifulSoup
url = "https://warframe.market/items/ivara_prime_blueprint"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
script_tag = soup.select_one("#application-state")
json_data = json.loads(script_tag.string)
# Uncomment the line below to see all the data
# from pprint import pprint
# pprint(json_data)
for data in json_data["payload"]["orders"]:
print(data["user"]["ingame_name"])
Prints:
Rogue_Monarch
Rappei
KentKoes
Tenno61189
spinifer14
Andyfr0nt
hollowberzinho
You can access the data as a dict and acess the keys/values.
I'd recommend an online tool to view all the JSON since it's quite large.
See also
Parsing out specific values from JSON object in BeautifulSoup

How to extract data from dynamic website?

I am trying to get the restaurant name and address of each restaurant from this platform:
https://customers.dlivery.live/en/list
So far I tried with BeautifulSoup
import requests
from bs4 import BeautifulSoup
import json
url = 'https://customers.dlivery.live/en/list'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
soup
I noticed that within soup there is not the data about the restaurants.
How can I do this?
if you inspect element the page, you will notice that the names are wrapped in the card_heading class, and the addresses are wrapped in card_distance class.
soup = BeautifulSoup(response.text, 'html.parser')
restaurantAddress = soup.find_all(class_='card_distance')
for address in restaurantAddress:
print(address.text)
and
soup = BeautifulSoup(response.text, 'html.parser')
restaurantNames = soup.find_all(class_='card_heading')
for name in restaurantNames:
print(name.text)
Not sure if this exact code will work, but this is pretty close to what you are looking for.

Get text from <span class: with Beautifulsoup and requests

so I tried to get a specific text from a website but it only gives me the error (floor = soup.find('span', {'class': 'text-white fs-14px text-truncate attribute-value'}).text
AttributeError: 'NoneType' object has no attribute 'text')
I specifically want to get the 'Floor Price' text.
My code:
import bs4
from bs4 import BeautifulSoup
#target url
url = "https://magiceden.io/marketplace/solsamo"
#act like browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get('https://magiceden.io/marketplace/solsamo')
#parse the downloaded page
soup = BeautifulSoup(response.content, 'lxml')
floor = soup.find('span', {'class': 'text-white fs-14px text-truncate attribute-value'}).text
print(floor)
There is no needed data in HTML you receive after:
response = requests.get('https://magiceden.io/marketplace/solsamo')
You can make sure of this by looking at page source code:
view-source:https://magiceden.io/marketplace/solsamo
You should use Selenium instead requests to get your data or you can examine XHR-requests on this page, maybe you can get this data using requests by following other link.

Find specific Tag Python BeautifulSoup

Hey I'm trying to extract URLs between 2 tags
This is what i got so far:
html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
links = []
for links in soup.findAll('cite'):
print(links.get('cite'))
I have tried different things but I couldn't extract the URL between
<cite>.....</cite>
My code Updated
import requests
from bs4 import BeautifulSoup as bs
dorks = input("Keyword : ")
binglist = "http://www.bing.com/search?q="
with open(dorks , mode="r",encoding="utf-8") as my_file:
for line in my_file:
clean = binglist + line
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
r = requests.get(clean, headers=headers)
soup = bs(r.text, 'html.parser')
links = soup.find('cite')
print(links)
In keyword file you just need to put any keyword like :
test
games
Thanks for your help
You can do it as follows:
html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
links = soup.find('cite')
for link in links:
print(link.text)
You can webscrape Bing as follows:
import requests
from bs4 import BeautifulSoup as bs
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
r = requests.get("https://www.bing.com/search?q=test", headers=headers)
soup = bs(r.text, 'html.parser')
links = soup.find('cite')
for link in links:
print(link.text)
This code does the following:
With request we get the Web Page we're looking for. We set headers to avoid being blocked by Bing (more information, see: https://oxylabs.io/blog/5-key-http-headers-for-web-scraping)
Then we HTML'ify the code, and extract all codetags (this returns a list)
For each element in the list, we only want what's inside the codetag, using .text we print the inside of this tag.
Please pay attention to the headers!
Try this:
html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
links = soup.find_all('cite')
for link in links:
print(link.text)
You're looking for this to get links from Bing organic results:
# container with needed data: title, link, snippet, etc.
for result in soup.select(".b_algo"):
link = result.select_one("h2 a")["href"]
Specifically for example provided by you:
from bs4 import BeautifulSoup
html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
link = soup.select_one('.b_attribution cite').text
print(link)
# https://www.developpez.net/forums/d1497343/environnements-developpem...
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"
}
params = {
"q": "lasagna",
"hl": "en",
}
html = requests.get("https://www.bing.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, "lxml")
for links in soup.select(".b_algo"):
link = links.select_one("h2 a")["href"]
print(link)
------------
'''
https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/
https://www.foodnetwork.com/topics/lasagna
https://www.tasteofhome.com/recipes/best-lasagna/
https://www.simplyrecipes.com/recipes/lasagna/
'''
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with extraction, maintain, bypass from the blocks part, instead, you only need to iterate over structured JSON and get what you want.
Code to integrate to achieve your goal:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "bing",
"q": "lion king"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
------------
'''
https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/
https://www.foodnetwork.com/topics/lasagna
https://www.tasteofhome.com/recipes/best-lasagna/
https://www.simplyrecipes.com/recipes/lasagna/
'''
Disclaimer, I work for SerpApi.

Webscraping latitude longitude from google results

How can I scrape latitude and longitude from the google results in the image below using beautiful soup.
Google result latitude longitude
Here is the code for do it with bs4:
from requests import get
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',}
response = get("https://www.google.com/search?q=latitude+longitude+of+75270+postal+code+paris+france",headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
a = soup.find("div", class_= "Z0LcW").text
print(a)
Please provide more input on further questions since we don't want to do the pre-work to create a solution.
You will have to grab this container:
<div class="HwtpBd gsrt PZPZlf" data-attrid="kc:/location/location:coordinates" aria-level="3" role="heading"><div class="Z0LcW XcVN5d">48.8573° N, 2.3370° E</div><div></div></div>
BS4
#BeautifoulSoup Stuff
import requests
from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import re
# Make the request
url = "https://www.google.com/search?q=latitude+longitude+of+75270+postal+code+paris+france&rlz=1C1CHBF_deDE740DE740&oq=latitude+longitude+of+75270+postal+code+paris+france&aqs=chrome..69i57.4020j0j8&sourceid=chrome&ie=UTF-8"
response = requests.get(url)
# Convert it to proper html
html = response.text
# Parse it in html document
soup = BeautifulSoup(html, 'html.parser')
# Grab the container and its content
target_container = soup.find("div", {"class": "Z0LcW XcVN5d"}).text
Then you have a string inside the div returned.
..Assuming google doesn't change the class declarations randomly. I tried five refreshes and the classname didn't change, but who knows.
Make sure you're using user-agent (you can also use python fake user-agents library)
Code and replit.com that grabs location from Google Search results:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=latitude longitude of 75270 postal code paris france',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
location = soup.select_one('.XcVN5d').text
print(location)
Output:
48.8573° N, 2.3370° E

Categories