Shortened link not working with BeautifulSoup Python - python

This code gets the information from the site perfectly fine:
url = 'https://www.vogue.com/article/mamma-mia-2-here-we-go-again-review?mbid=social_twitter'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
title = soup.find("meta", {"name": "twitter:title"})
title2 = soup.find("meta", property="og:title")
title3 = soup.find("meta", property="og:description")
print("TITLE: "+str(title['content']))
print("TITLE2: "+str(title2['content']))
print("TITLE3: "+str(title3['content']))
However, when I replace the url with this shortened link it returns:
print("TITLE: "+str(title['content']))
TypeError: 'NoneType' object has no attribute '__getitem__'

The url-shortener sends a meta-refresh to redirect to desired page. This code should help:
from bs4 import BeautifulSoup
import requests
import re
shortened_url = '<YOUR SHORTENED URL>'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
response = requests.get(shortened_url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
while True:
# is meta refresh there?
if soup.select_one('meta[http-equiv=refresh]'):
refresh_url = re.search(r'url=(.*)', soup.select_one('meta[http-equiv=refresh]')['content'], flags=re.I)[1]
response = requests.get(refresh_url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
else:
break
title = soup.find("meta", {"name": "twitter:title"})
title2 = soup.find("meta", property="og:title")
title3 = soup.find("meta", property="og:description")
print("TITLE: "+str(title['content']))
print("TITLE2: "+str(title2['content']))
print("TITLE3: "+str(title3['content']))
Prints:
TITLE: Mamma Mia! Here We Go Again Is the Only Good Thing About This Summer - Vogue
TITLE2: Mamma Mia! Here We Go Again Is the Only Good Thing About This Summer
TITLE3: Is it possible to change your country of origin to a movie sequel?

Related

how to get the content of a title using BeautifulSoup4 and requests

So i have taken the title of the medicines from this link : Medicines List
now i want to get the content for every medicines meanwhile every medicines has it owns link
Example :
Medicines Example
how can I get the content of that medicines using BeautifulSoup4 and requests library?
import requests
from bs4 import BeautifulSoup
from pprint import pp
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'
}
def main(url):
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
title = [x.text for x in soup.select(
'a[class$=section__item-link]')]
count = 0
for x in range (0, len(title)):
count += 1
print("{0}. {1}\n".format(count, title[x]))
main('https://www.klikdokter.com/obat')
Based on what I can see as the response from https://www.klikdokter.com/obat you should be able to do something like this:-
import requests
from bs4 import BeautifulSoup
AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'
BASEURL = 'https://www.klikdokter.com/obat'
headers = {'User-Agent': AGENT}
response = requests.get(BASEURL, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
for tag in soup.find_all('a', class_='topics-index--section__item-link'):
href = tag.get('href')
if href is not None:
print(href)
response = requests.get(href, headers=headers)
response.raise_for_status()
""" Do your processing here """

Why can't I scrape Amazon products by BeautifulSoup?

I am trying to scrape the heading of this Amazon listing. The code I wrote is working for some other Amazon listings, but not working for the url mentioned in the code below.
Here is the python code I've tried:
import requests
from bs4 import BeautifulSoup
url="https://www.amazon.in/BULLMER-Cotton-Printed-T-shirt-Multicolour/dp/B0892SZX7F/ref=sr_1_4?c=ts&dchild=1&keywords=Men%27s+T-Shirts&pf_rd_i=1968024031&pf_rd_m=A1VBAL9TL5WCBF&pf_rd_p=8b97601b-3643-402d-866f-95cc6c9f08d4&pf_rd_r=EPY70Y57HP1220DK033Y&pf_rd_s=merchandised-search-6&qid=1596817115&refinements=p_72%3A1318477031&s=apparel&sr=1-4&ts_id=1968123031"
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"}
page = requests.get(url, headers=headers)
print(page.status_code)
soup = BeautifulSoup(page.content, "html.parser")
#print(soup.prettify())
title = soup.find(id = "productTitle")
if title:
title = title.get_text()
else:
title = "default_title"
print(title)
Output:
200
default_title
html code from inspector tools:
<span id="productTitle" class="a-size-large product-title-word-break">
BULLMER Mens Halfsleeve Round Neck Printed Cotton Tshirt - Combo Tshirt - Pack of 3
</span>
First, As others have commented, use a proxy service. Second in order to go amazon product page if you have an asin that's enough.
Amazon follows this url pattern for all product pages.
https://www.amazon.(com/in/fr)/dp/<asin>
import requests
from bs4 import BeautifulSoup
url="https://www.amazon.in/dp/B0892SZX7F"
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
page = requests.get(url, headers=headers)
print(page.status_code)
soup = BeautifulSoup(page.content, "html.parser")
title = soup.find("span", {"id":"productTitle"})
if title:
title = title.get_text(strip=True)
else:
title = "default_title"
print(title)
Output:
200
BULLMER Mens Halfsleeve Round Neck Printed Cotton Tshirt - Combo Tshirt - Pack of 3
this worked fine for me:
import requests
from bs4 import BeautifulSoup
url="https://www.amazon.in/BULLMER-Cotton-Printed-T-shirt-Multicolour/dp/B0892SZX7F/ref=sr_1_4?c=ts&dchild=1&keywords=Men%27s+T-Shirts&pf_rd_i=1968024031&pf_rd_m=A1VBAL9TL5WCBF&pf_rd_p=8b97601b-3643-402d-866f-95cc6c9f08d4&pf_rd_r=EPY70Y57HP1220DK033Y&pf_rd_s=merchandised-search-6&qid=1596817115&refinements=p_72%3A1318477031&s=apparel&sr=1-4&ts_id=1968123031"
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"}
http_proxy = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy = "ftp://10.10.1.10:3128"
proxyDict = {
"http" : http_proxy,
"https" : https_proxy,
"ftp" : ftp_proxy
}
page = requests.get(url, headers=headers)
print(page.status_code)
soup = BeautifulSoup(page.content, "lxml")
#print(soup.prettify())
title = soup.find(id = "productTitle")
if title:
title = title.get_text()
else:
title = "default_title"
print(title)

soup.find returning "none" only sometimes?

I am scraping an Amazon product page and using Beautiful Soup to find the product name and price. For some reason, the "title" variable will return sometimes and other times I will get the error, "'NoneType' object has no attribute 'get_text'"
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Lenovo-ThinkPad-i5-10210U-i7-7500U-Wireless/\
dp/B08BYZD4H9/ref=sr_1_2_sspa?dchild=1&keywords=thinkpad&qid=1595377662&sr=8\
-2-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEyMVhTU1BOODg5TlgmZW5jcnlwdGVkS\
WQ9QTAzMTc5MDFMNjhGMUE0VlRHT1gmZW5jcnlwdGVkQWRJZD1BMDY3MDc3MzJPQzc2QkI5UlcwSUE\
md2lkZ2V0TmFtZT1zcF9hdGYmYWN0aW9uPWNsaWNrUmVkaXJlY3QmZG9Ob3RMb2dDbGljaz10cnVl'
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle").get_text()
price = soup.find(id="priceblock_ourprice").get_text()
converted_price = int(price[1:6].replace(',',''))
print(converted_price)
print(title)
Try to specify more HTTP headers, for example User-Agent and Accept-Language. Also, change the parser to lxml or html5lib.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept-Language': 'en-US,en;q=0.5'
}
URL = 'https://www.amazon.com/Lenovo-ThinkPad-i5-10210U-i7-7500U-Wireless/dp/B08BYZD4H9/ref=sr_1_2_sspa?dchild=1&keywords=thinkpad&qid=1595377662&sr=8-2-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEyMVhTU1BOODg5TlgmZW5jcnlwdGVkSWQ9QTAzMTc5MDFMNjhGMUE0VlRHT1gmZW5jcnlwdGVkQWRJZD1BMDY3MDc3MzJPQzc2QkI5UlcwSUEmd2lkZ2V0TmFtZT1zcF9hdGYmYWN0aW9uPWNsaWNrUmVkaXJlY3QmZG9Ob3RMb2dDbGljaz10cnVl'
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'lxml') # <-- change to `lxml` or `html5lib`
title = soup.find(id="productTitle").get_text(strip=True)
price = soup.find(id="priceblock_ourprice").get_text(strip=True)
converted_price = int(price[1:6].replace(',',''))
print(converted_price)
print(title)
Prints (in my testing always):
1049
2020 Lenovo ThinkPad E15 15.6 Inch FHD 1080P Laptop| Intel 4-Core i5-10210U (Beats i7-7500U)| 16GB RAM| 1TB SSD (Boot) + 500GB HDD| FP Reader| Win10 Pro+ NexiGo Wireless Mouse Bundle
You are getting this error 'NoneType' object has no attribute 'get_text' because the webpage's data is changing and there is no attribute having id="productTitle" or there is no attribute having id="priceblock_ourprice".
Put some debug statements like this and you will know why exactly this error is coming.
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)
title_soup = soup.find(id="productTitle")
print(title_soup) # <- this might print None
print(title_soup.get_text())
price_soup = soup.find(id="priceblock_ourprice")
print(price_soup) # <- this might print None
print(price_soup.get_text())

Finding appropriate text

<div class="product-name">
CLR2811
</div>
I want to scrape this Product name. My Code :
ProductTitle = page_soup.find("div",attrs = {'class':'product-name'})
This Should Probably return me the right things i-e CLR2811 but when I print ProductTitle its returns me.
<div class="product-name">
</div>
Just the name is missing
URL = http://www.coolline-group.com/product-details.php?pid=5a3c8ac755d2f
As #AlexDotis pointed you, you need to use the element's text attribute:
from bs4 import BeautifulSoup
import requests
headers = requests.utils.default_headers()
headers.update({ 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'})
url = "http://www.coolline-group.com/product-details.php?pid=5a3c8ac755d2f"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
name = soup.find("div",attrs = {'class':'product-name'})
print (name.text.strip())
Output:
CLR2811

BeautifulSoup not finding meta tag information

All three titles return 'None'. However, when I view the page source, I can clearly see twitter:title, og:title and og:description clearly exists.
url = 'https://www.vox.com/culture/2018/8/3/17644464/christopher-robin-review-pooh-bear-winnie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
title = soup.find("meta", property="twitter:title")
title2 = soup.find("meta", property="og:title")
title3 = soup.find("meta", property="og:description")
print("TITLE: "+str(title))
print("TITLE2: "+str(title2))
print("TITLE3: "+str(title3))
soup.find("meta", property="twitter:title") must be soup.find("meta", {"name": "twitter:title"}) (it's a name, not a property). The other two lines work fine for me.
You need to specify User-Agent in headers, also twitter:title is in name attribute:
from bs4 import BeautifulSoup
import requests
url = 'https://www.vox.com/culture/2018/8/3/17644464/christopher-robin-review-pooh-bear-winnie'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
title1 = soup.select_one('meta[name=twitter:title]')['content']
title2 = soup.select_one('meta[property=og:title]')['content']
title3 = soup.select_one('meta[property=og:description]')['content']
print("TITLE1: "+str(title1))
print("TITLE2: "+str(title2))
print("TITLE3: "+str(title3))
Prints:
TITLE1: Christopher Robin is a corporate cash-in, but it fakes sincerity better than most
TITLE2: Christopher Robin is a corporate cash-in, but it fakes sincerity better than most
TITLE3: Winnie the Pooh and pals return to give their old friend a pep talk in a movie overshadowed by the company that made it.

Categories