Python .strip() function gives error on variable with HTML (BeautifulSoup) - python

This code scrapes amazon for a product name. I wanted to strip this variable, which contains HTML of its whitespace,
span = soup.find("span", id="productTitle")
print(span.strip())
but it gives me this error;
Traceback (most recent call last):
File "C:/Users/avensis/Desktop/Projects/AmazonScraper/Scraper.py", line 17, in <module>
print(span.strip())
TypeError: 'NoneType' object is not callable
I don't understand why this occurs. Can someone please explain? Here is my full code:
from bs4 import BeautifulSoup
import requests
import html5lib
url = 'https://www.amazon.co.uk/Pingu-PING2573-Mug/dp/B0764468MD/ref=sr_1_11?dchild=1&keywords=pingu&qid=1595849018' \
'&sr=8-11 '
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/84.0.4147.89 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html5lib')
span = soup.find("span", id="productTitle")
print(span.strip())

I guess this is what you want to do:
from bs4 import BeautifulSoup
import requests
import html5lib
import random
url = 'https://www.amazon.co.uk/Pingu-PING2573-Mug/dp/B0764468MD/ref=sr_1_11?dchild=1&keywords=pingu&qid=1595849018' \
'&sr=8-11 '
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/84.0.4147.89 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html5lib')
span = soup.find("span", id="productTitle")
print(span.get_text(strip=True))
prints:
Pingu - Mug | 300 ml | Ceramic | Gift Box | 11 x 8.5 x 8.5 cm
If it is what you looking for it was the .get_text(strip=True) you missed

Use .get_text() method:
span.get_text().replace("\n", "")
'Pingu - Mug | 300 ml | Ceramic | Gift Box | 11 x 8.5 x 8.5 cm'

Related

Can't get all results in tripadvisor using python al beautifulsoup due to pagination

I am trying to get links of restaurants but i can only get the first 30 and not all the others.
Restaurants in Madrid Area are hundreads, the pagination only shows 30 in each page and the following code only get those 30
import re
import requests
from openpyxl import Workbook
from bs4 import BeautifulSoup as b
city_name = 'Madrid'
geo_code = '187514'
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
data = requests.get(
"https://www.tripadvisor.com//Restaurants-g{}-{}.html".format(geo_code, city_name), headers=headers
).text
for link in re.findall(r'"detailPageUrl":"(.*?)"', data):
print("https://www.tripadvisor.com.sg/" + link)
next_link = "https://www.tripadvisor.com.sg/" + link
f.write('%s\n' % next_link)
Found the solution, had to add ao with number of the result in the url like:
"https://www.tripadvisor.com//Restaurants-g{}-{}-{}.html".format(geo_code, city_name, n_review), headers=headers

Send POST request in Python

I'm trying to scrape a website in which I need to send a POST request to a form to query data. Here is the code I'm using.
import requests
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"}
with requests.Session() as s:
r = s.get('https://data.rabbu.com', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
hidden = soup.find_all("input", {'type':'hidden'})
payload = {x["name"]: x["value"] for x in hidden}
payload['search'] = '16101 Tampa Street, Brooksville FL 34604'
payload['bedrooms'] = '2'
r = s.post('https://data.rabbu.com/e', headers=headers, data=payload)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.text)
But I'm unable to send properly the POST request because I'm getting the following error message:
"The change you wanted was rejected (422)"
I tried to use the "json" argument instead of "data" - to no avail.
Do you have any idea how I can bypass this issue? Any help would be appreciated.
Your parameters need to be changed. Try the following:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"}
with requests.Session() as s:
r = s.get('https://data.rabbu.com', headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
hidden = soup.find_all("input", {'type':'hidden'})
payload = {x["name"]: x["value"] for x in hidden}
payload['estimate[address]'] = '16101 Tampa Street, Brooksville FL 34604'
payload['estimate[bedrooms]'] = '2'
r = s.post('https://data.rabbu.com/e', headers=headers, params=payload)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.title.text)
Giving you:
16101 Tampa St, Brooksville, FL 34604, USA | Revenue Projection: $1,639/mo | 2 to 2bds | 13 comps | Rabbu

I get an Python BeautifulSoup None Type Error [duplicate]

This question already has an answer here:
Webscraping with beautifulsoup 'NoneType' object has no attribute 'get_text'
(1 answer)
Closed 1 year ago.
url = "https://www.imdb.com/chart/top/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content,"html.parser")
puan = soup.find_all("tr")
for i in puan:
puan2 = i.find_all("td",{"class":"ratingColumn"})
for x in puan2:
puan3 = x.find("strong")
print(puan3.text)
I'm scraping with BeautifulSoup. In the results I found, I get an error because there is NoneType in the list. How can I remove the NoneType parts from the list
Adding a simple if guard will do the trick:
if puan3 is not None:
print(puan3.text)
What happens?
Your selection is not that specific so you get a resultset, that also consists of elements you won't like to select.
How to fix?
Select your elements more specific:
i.find_all("td",{"class":"imdbRating"})
or
for row in soup.select('table.chart tbody tr'):
rating = row.select_one('.imdbRating strong').text
print(rating)
and additional with a double check:
for row in soup.select('table.chart tbody tr'):
rating = rating.text if (rating := row.select_one('.imdbRating strong')) else None
print(rating)
Example (based on your code)
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/chart/top/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content,"html.parser")
puan = soup.find_all("tr")
for i in puan:
puan2 = i.find_all("td",{"class":"imdbRating"})
for x in puan2:
puan3 = x.find("strong")
print(puan3.text)
Example (css selectors)
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/chart/top/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content,"html.parser")
for row in soup.select('table.chart tbody tr'):
rating = rating.text if (rating := row.select_one('.imdbRating strong')) else None
print(rating)

How do I modify code to parse multiple URL?

I have this code that gets all child URLs within a page.
How do I parse multipe URLs through this code?
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/91.0.4472.114 Safari/537.36'}
source = requests.get("https://www.oddsportal.com/soccer/england/efl-cup/results/", headers=headers)
soup = BeautifulSoup(source.text, 'html.parser')
main_div = soup.find("div", class_="main-menu2 main-menu-gray")
a_tag = main_div.find_all("a")
for i in a_tag:
print(i['href'])
How do I modify it to run for multiple URLs
while my URL list is as:
df:
| | URL |
|----|---------------------------------------------------------------------|
| 0 | https://www.oddsportal.com/soccer/nigeria/npfl-pre-season/results/ |
| 1 | https://www.oddsportal.com/soccer/england/efl-cup/results/ |
| 2 | https://www.oddsportal.com/soccer/europe/guadiana-cup/results/ |
| 3 | https://www.oddsportal.com/soccer/world/kings-cup-thailand/results/ |
| 4 | https://www.oddsportal.com/soccer/poland/division-2-east/results/ |
I tried parsing it this way :
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/91.0.4472.114 Safari/537.36'}
for url in df:
source = requests.get(df['URL'], headers=headers)
soup = BeautifulSoup(source.text, 'html.parser')
main_div = soup.find("div", class_="main-menu2 main-menu-gray")
a_tag = main_div.find_all("a")
for i in a_tag:
print(i['href'])
However I am getting this error:
line 742, in get_adapter
raise InvalidSchema("No connection adapters were found for {!r}".format(url))
How can I modify the same to parse multiple URLs?
change
for url in df:
source = requests.get(df['URL'], headers=headers)
To
for url in df['URL']:
source = requests.get(url, headers=headers)

Scrape and return a value from within a div class with Python

Any idea how can i retrieve the price (now 2917.99) from this source code view-https://www.emag.ro/televizor-led-smart-samsung-138-cm-55ru7402-4k-ultra-hd-ue55ru7402uxxh/pd/DTN2XZBBM/
If I call the class p.product-new-price i get None.
I have managed to get the title, but not the price.
What I have done so far:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.emag.ro/televizor-led-smart-samsung-138-cm-55ru7402-4k-ultra-hd-ue55ru7402uxxh/pd/DTN2XZBBM/'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('title')
div = soup.find('div', {"class" : 'product-new-price'})
text = div.string
print(text)
The class looks like below and I want to extract the 2917 as int.
div class="product-highlight product-page-pricing"
p class="product-new-price"
2.917<sup>99</sup> <span>Lei</span>
Thank you very much!
Ok, with minor modifications:
It seems that the class product-new-price is on the p element for me!
I am assuming there will always be a <sup> tag after the main price
import requests
from bs4 import BeautifulSoup
URL = 'https://www.emag.ro/televizor-led-smart-samsung-138-cm-55ru7402-4k-ultra-hd-ue55ru7402uxxh/pd/DTN2XZBBM/'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('title')
p = soup.find('p', {"class" : 'product-new-price'})
# Get the text before <sup> tag
value = p.find('sup').previousSibling.strip()
print("Value: {}".format(value))
# Keep only numbers
value = ''.join(c for c in value if c.isdigit())
price = int(value)
print("Price: {}".format(price))
The above prints:
$ python3 ./test.py
Value: 2.917
Price: 2917
Now, with small changes you can also add the missing .99 if this is required

Categories