Beautifulsoup parsing error

Beautifulsoup parsing error - python

I am trying to extract some information about an App on Google Play and BeautifulSoup doesn't seem to work.
The link is this(say):
https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts
My code:
url = "https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts"
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html)
l = soup.find_all("div", { "class" : "document-subtitles"})
print len(l)
0 #How is this 0?! There is clearly a div with that class
I decided to go all in, didn't work either:
i = soup.select('html body.no-focus-outline.sidebar-visible.user-has-no-subscription div#wrapper.wrapper.wrapper-with-footer div#body-content.body-content div.outer-container div.inner-container div.main-content div div.details-wrapper.apps.square-cover.id-track-partial-impression.id-deep-link-item div.details-info div.info-container div.info-box-top')
print i
What am I doing wrong?

You need to pretend to be a real browser by supplying the User-Agent header:
import requests
from bs4 import BeautifulSoup
url = "https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts"
r = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
})
html = r.content
soup = BeautifulSoup(html, "html.parser")
title = soup.find(class_="id-app-title").get_text()
rating = soup.select_one(".document-subtitle .star-rating-non-editable-container")["aria-label"].strip()
print(title)
print(rating)
Prints the title and the current rating:
Weird Facts
Rated 4.3 stars out of five stars
To get the additional information field values, you can use the following generic function:
def get_info(soup, text):
return soup.find("div", class_="title", text=lambda t: t and t.strip() == text).\
find_next_sibling("div", class_="content").get_text(strip=True)
Then, if you do:
print(get_info(soup, "Size"))
print(get_info(soup, "Developer"))
You will see printed:
1.4M
Email email#here.com

Related

how to scrape page inside the result card using Bs4?

<img class="no-img" data-src="https://im1.dineout.co.in/images/uploads/restaurant/sharpen/4/h/u/p4059-15500352575c63a9394c209.jpg?tr=tr:n-medium" alt="Biryani By Kilo" data-gatype="RestaurantImageClick" data-url="/delhi/biryani-by-kilo-connaught-place-central-delhi-40178" data-w-onclick="cardClickHandler" src="https://im1.dineout.co.in/images/uploads/restaurant/sharpen/4/h/u/p4059-15500352575c63a9394c209.jpg?tr=tr:n-medium">
page url - https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p=1
this page contains some restaurants card now while scrapping the page in the loop I want to go inside the restaurant card URL which is in the above HTML code name by data-url class and scrape the no. of reviews from inside it, I don't know how to do it my current code for normal front page scrapping is ;
def extract(page):
url = f"https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p={page}" # URL of the website
header = {'User-Agent':'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'} # Temporary user agent
r = requests.get(url, headers=header)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup): # function to scrape the page
divs = soup.find_all('div', class_ = 'restnt-card restaurant')
for item in divs:
title = item.find('a').text.strip() # restaurant name
loc = item.find('div', class_ = 'restnt-loc ellipsis').text.strip() # restaurant location
try: # used this try and except method because some restaurants are unrated and while scrpaping those we would run into an error
rating = item.find('div', class_="img-wrap").text
rating = (re.sub("[^0-9,.]", "", rating))
except:
rating = None
pricce = item.find('span', class_="double-line-ellipsis").text.strip() # price for biriyani
price = re.sub("[^0-9]", "", pricce)[:-1]
biry_del = {
'name': title,
'location': loc,
'rating': rating,
'price': price
}
rest_list.append(biry_del)
rest_list = []
for i in range(1,18):
print(f'getting page, {i}')
c = extract(i)
transform(c)
I hope you guys understood please ask in comment for any confusion.

It's not very fast but it looks like you can get all the details you want including the review count (not 232!) if you hit this backend api endpoint:
https://www.dineout.co.in/get_rdp_data_main/delhi/69676/restaurant_detail_main
import requests
from bs4 import BeautifulSoup
import pandas as pd
rest_list = []
for page in range(1,3):
print(f'getting page, {page}')
s = requests.Session()
url = f"https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p={page}" # URL of the website
header = {'User-Agent':'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'} # Temporary user agent
r = s.get(url, headers=header)
soup = BeautifulSoup(r.content, 'html.parser')
divs = soup.find_all('div', class_ = 'restnt-card restaurant')
for item in divs:
code = item.find('a')['href'].split('-')[-1] # restaurant code
print(f'Getting details for {code}')
data = s.get(f'https://www.dineout.co.in/get_rdp_data_main/delhi/{code}/restaurant_detail_main').json()
info = data['header']
info.pop('share') #clean up csv
info.pop('options')
rest_list.append(info)
df = pd.DataFrame(rest_list)
df.to_csv('dehli_rest.csv',index=False)

How to get all 'href' with soup in python ? I try so many times but not work

How to get all 'href' with soup in python ? I try so many times but in vain.
Whatever I use 'soup.find' or 'soup.find_all' method to strugle for the 'href', it doesn't work.
python version:3.10
!pip install requests
import requests
import time
import pandas as pd
from bs4 import BeautifulSoup
productlink = []
headers = {'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Mobile Safari/537.36'}
for page in range(1,2):
url = "https://www.momomall.com.tw/s/103487/dcategory/all/3/{page}"
r = requests.get(url, headers = headers)
Soup = BeautifulSoup(r.text,"lxml")
for link in Soup.find_all('ul',class_="searchItem Stype"):
print(len(link))
Link = link.li.a
LINK = Link.get('href')
print(LINK)
productlink.append(LINK)
print(productlink)

sorry i misunderstood totally your problem. find_all is not a very versatile tool and you were searching for the wrong ul
i barely changed your code but it seems to work now
import requests
import time
import pandas as pd
from bs4 import BeautifulSoup
productlink = []
headers = {'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Mobile Safari/537.36'}
for page in range(1,2):
url = f"https://www.momomall.com.tw/s/103487/dcategory/all/3/{page}"
r = requests.get(url, headers = headers)
Soup = BeautifulSoup(r.text,"lxml")
for link in Soup.select('ul#surveyContent > li > a[href]:first-of-type'):
print(len(link))
# ~ Link = link.li.a
LINK = link.get('href')
print(LINK)
productlink.append(LINK)
print(productlink)

for page in range(1,2):
url = "https://m.momomall.com.tw/m/store/DCategory.jsp?entp_code=103487&category_code=all&orderby=3&page={}".format(page)
r = requests.get(url,headers = headers)
soup = BeautifulSoup(r.text,'lxml')
for goods_code in soup.select('a.nofollowBtn_star'):
Goods_code = 'https://www.momomall.com.tw/s/103487/'+goods_code.get('goods_code')+'/'
goodlink.append(Goods_code)
for URL in goodlink:
R = requests.get(URL, headers = headers)
Soup = BeautifulSoup(R.text,"lxml")
for dataprice in Soup.select('script'):
import re
discount_regex=re.compile('discountPrice = (\d{1,5})')
print(re.search(discount_regex, dataprice).group(1))
```

I am having some trouble with web scraping using beautifulsoup

when ever i try to extract text between tags using .text() it gives a blank screen with just [] as output
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.amazon.in/s?k=ssd&ref=nb_sb_noss")
soup = BeautifulSoup(page.content, "html.parser")
product = soup.find_all("h2",class_="a-link-normal a-text-normal")
results = soup.find_all("span",class_="a-offscreen")
print(product)
this is the output that i got :
C:\Users\Kushal\Desktop\requests-tutorial>C:/Users/Kushal/AppData/Local/Programs/Python/Python37/python.exe c:/Users/Kushal/Desktop/requests-tutorial/scraper.py
[]
when i try listing everything with a for loop then, nothing shows up not even the empty square brackets

Based on your comment below. I've modified the code to fetch all the product title on the said page along with the price details.
Mark as answer if it works, else comment for further analysis.
import requests
from bs4 import BeautifulSoup
import lxml
dataList = list()
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-charset": "cp1254,ISO-8859-9,utf-8;q=0.7,*;q=0.3",
"accept-encoding": "gzip,deflate,sdch",
"accept-language": "tr,tr-TR,en-US,en;q=0.8",
}
url = requests.get('https://www.amazon.in/s?k=ssd&ref=nb_sb_noss'.format(), headers=headers)
soup = BeautifulSoup(url.content, 'lxml')
title = soup.find_all('span', attrs={'class':'a-size-medium a-color-base a-text-normal'})
price = soup.find_all('span', attrs={'class':'a-offscreen'})
for product in zip(title,price):
title,price=product
title_proper=title.text.strip()
price_proper=price.text.strip()
print(title_proper,'-',price_proper)

Python web scrape numerical weather data

I am attempting to print the int value of current outside air temperature. (55)
Any chance for a tip on what I am doing wrong? (sorry not a lot of wisdom here!)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime as dt
#this is used at the end with plotting results to current hour
h = dt.datetime.now().hour
r = requests.get(
'https://www.google.com/search?q=weather+duluth')
soup = BeautifulSoup(r.text, 'html.parser')
stuff = []
for item in soup.select('vk_bk sol-tmp'):
item = int(item.contents[1].get_text(strip=True)[:-1])
#print(item)#this is weather data
stuff.append(item)
This is the web URL for weather and the current outdoor temperature is tied to the div class highlighted below.
If I attempt to print stuff I just get an empty list returned.

Adding User-Agent header should give expected result
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get('https://www.google.com/search?q=weather%20duluth', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
soup.find("span", {"class": "wob_t"}).text

Scrape table fields from html with specific class

So I want to build a simple scraper for google shopping and I encountered some problems.
This is the html text from my request(to https://www.google.es/shopping/product/7541391777504770249/online) where I'm trying to query the highlighted div class sh-osd__total-price inside the div class sh-osd__offer-row :
My code is currently:
from bs4 import BeautifulSoup
from requests import get
url = 'https://www.google.es/shopping/product/7541391777504770249/online'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
r = html_soup.findAll('tr', {'class': 'sh-osd__offer-row'}) #Returns empty
print(r)
r = html_soup.findAll('tr', {'class': 'sh-osd__total-price'}) #Returns empty
print(r)
Where both r are empty, beatiful soup doesn't find anything.
Is there any way to find these two div classes with beautiful soup?

You need to add user agent into the headers:
from bs4 import BeautifulSoup
from requests import get
url = 'https://www.google.es/shopping/product/7541391777504770249/online'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'} #<-- added line
response = get(url, headers=headers) #<--- include here
html_soup = BeautifulSoup(response.text, 'html.parser')
r = html_soup.find_all('tr', {'class': 'sh-osd__offer-row'}) #Returns empty
print(r)
r = html_soup.findAll('tr', {'class': 'sh-osd__total-price'}) #Returns empty
print(r)
But, since it's a <table> tag, you can use pandas (it uses beautifulsoup under the hood), but does the hard work for you. It will return a list of all elements that are <table>s as dataframes
import pandas as pd
url = 'https://www.google.es/shopping/product/7541391777504770249/online'
dfs = pd.read_html(url)
print(dfs[-1])
Output:
print(dfs[-1])
Sellers Seller Rating ... Base Price Total Price
0 One Fragance No rating ... £30.95 +£8.76 delivery £39.71
1 eBay No rating ... £46.81 £46.81
2 Carethy.co.uk No rating ... £34.46 +£3.99 delivery £38.45
3 fruugo.co.uk No rating ... £36.95 +£9.30 delivery £46.25
4 cosmeticsmegastore.com/gb No rating ... £36.95 +£9.30 delivery £46.25
5 Perfumes Club UK No rating ... £30.39 +£5.99 delivery £36.38
[6 rows x 5 columns]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup parsing error - python

Related

how to scrape page inside the result card using Bs4?

How to get all 'href' with soup in python ? I try so many times but not work

I am having some trouble with web scraping using beautifulsoup

Python web scrape numerical weather data

Scrape table fields from html with specific class

Categories

Resources