Finding appropriate text

Finding appropriate text - python

<div class="product-name">
CLR2811
</div>
I want to scrape this Product name. My Code :
ProductTitle = page_soup.find("div",attrs = {'class':'product-name'})
This Should Probably return me the right things i-e CLR2811 but when I print ProductTitle its returns me.
<div class="product-name">
</div>
Just the name is missing
URL = http://www.coolline-group.com/product-details.php?pid=5a3c8ac755d2f

As #AlexDotis pointed you, you need to use the element's text attribute:
from bs4 import BeautifulSoup
import requests
headers = requests.utils.default_headers()
headers.update({ 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'})
url = "http://www.coolline-group.com/product-details.php?pid=5a3c8ac755d2f"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
name = soup.find("div",attrs = {'class':'product-name'})
print (name.text.strip())
Output:
CLR2811

Related

How to grab data from inputDataGroup from a url

I am trying to grab some data from a url. My code does not work as it produces error.
import requests
from bs4 import BeautifulSoup as bs
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"}
r = requests.get('https://bscscan.com/tx/0x6cf01dda40e854b47117bb51c8d2148c6ae851e7d79d792867d1fa66d5ba6bad', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
test = soup.find('div', class_='rawtab').get_text()
print (test)
Wanted Output:
Function: lockTokens(address _token, address _withdrawer, uint256 _amount, uint256 _unlockTimestamp) ***
MethodID: 0x7d533c1e
[0]: 000000000000000000000000109fddb8d49ae139449cb061883b685827b23937
[1]: 000000000000000000000000d824567e3821e99e90930a84d3e320a0fdb6f867
[2]: 00000000000000000000000000000000000000000000a768da96dc82245ed943
[3]: 00000000000000000000000000000000000000000000000000000000628d8187
Current Output:
AttributeError: 'NoneType' object has no attribute 'get_text'

Your near to your goal, but have to switch the attribute type.
<div data-target-group="inputDataGroup" id="rawtab">
The <div> do not contain a class it is the id you should select and more specific its textarea:
soup.find('div', id='rawtab').textarea.text
or with css selector:
soup.select_one('div#rawtab textarea').text
Example
import requests
from bs4 import BeautifulSoup as bs
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"}
r = requests.get('https://bscscan.com/tx/0x6cf01dda40e854b47117bb51c8d2148c6ae851e7d79d792867d1fa66d5ba6bad', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
test = soup.select_one('div#rawtab textarea').text
print (test)
Output
Function: lockTokens(address _token, address _withdrawer, uint256 _amount, uint256 _unlockTimestamp) ***
MethodID: 0x7d533c1e
[0]: 000000000000000000000000109fddb8d49ae139449cb061883b685827b23937
[1]: 000000000000000000000000d824567e3821e99e90930a84d3e320a0fdb6f867
[2]: 00000000000000000000000000000000000000000000a768da96dc82245ed943
[3]: 00000000000000000000000000000000000000000000000000000000628d8187
Getting the items as list:
textArea = soup.select_one('div#rawtab textarea').text.split('\r\n')
data = [n.split()[-1] for n in textArea if ']:' in n]
Output data (you can pick by index):
['000000000000000000000000109fddb8d49ae139449cb061883b685827b23937',
'000000000000000000000000d824567e3821e99e90930a84d3e320a0fdb6f867',
'00000000000000000000000000000000000000000000a768da96dc82245ed943',
'00000000000000000000000000000000000000000000000000000000628d8187']

how to get the content of a title using BeautifulSoup4 and requests

So i have taken the title of the medicines from this link : Medicines List
now i want to get the content for every medicines meanwhile every medicines has it owns link
Example :
Medicines Example
how can I get the content of that medicines using BeautifulSoup4 and requests library?
import requests
from bs4 import BeautifulSoup
from pprint import pp
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'
}
def main(url):
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
title = [x.text for x in soup.select(
'a[class$=section__item-link]')]
count = 0
for x in range (0, len(title)):
count += 1
print("{0}. {1}\n".format(count, title[x]))
main('https://www.klikdokter.com/obat')

Based on what I can see as the response from https://www.klikdokter.com/obat you should be able to do something like this:-
import requests
from bs4 import BeautifulSoup
AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'
BASEURL = 'https://www.klikdokter.com/obat'
headers = {'User-Agent': AGENT}
response = requests.get(BASEURL, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
for tag in soup.find_all('a', class_='topics-index--section__item-link'):
href = tag.get('href')
if href is not None:
print(href)
response = requests.get(href, headers=headers)
response.raise_for_status()
""" Do your processing here """

Why can't I scrape Amazon products by BeautifulSoup?

I am trying to scrape the heading of this Amazon listing. The code I wrote is working for some other Amazon listings, but not working for the url mentioned in the code below.
Here is the python code I've tried:
import requests
from bs4 import BeautifulSoup
url="https://www.amazon.in/BULLMER-Cotton-Printed-T-shirt-Multicolour/dp/B0892SZX7F/ref=sr_1_4?c=ts&dchild=1&keywords=Men%27s+T-Shirts&pf_rd_i=1968024031&pf_rd_m=A1VBAL9TL5WCBF&pf_rd_p=8b97601b-3643-402d-866f-95cc6c9f08d4&pf_rd_r=EPY70Y57HP1220DK033Y&pf_rd_s=merchandised-search-6&qid=1596817115&refinements=p_72%3A1318477031&s=apparel&sr=1-4&ts_id=1968123031"
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"}
page = requests.get(url, headers=headers)
print(page.status_code)
soup = BeautifulSoup(page.content, "html.parser")
#print(soup.prettify())
title = soup.find(id = "productTitle")
if title:
title = title.get_text()
else:
title = "default_title"
print(title)
Output:
200
default_title
html code from inspector tools:
<span id="productTitle" class="a-size-large product-title-word-break">
BULLMER Mens Halfsleeve Round Neck Printed Cotton Tshirt - Combo Tshirt - Pack of 3
</span>

First, As others have commented, use a proxy service. Second in order to go amazon product page if you have an asin that's enough.
Amazon follows this url pattern for all product pages.
https://www.amazon.(com/in/fr)/dp/<asin>
import requests
from bs4 import BeautifulSoup
url="https://www.amazon.in/dp/B0892SZX7F"
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
page = requests.get(url, headers=headers)
print(page.status_code)
soup = BeautifulSoup(page.content, "html.parser")
title = soup.find("span", {"id":"productTitle"})
if title:
title = title.get_text(strip=True)
else:
title = "default_title"
print(title)
Output:
200
BULLMER Mens Halfsleeve Round Neck Printed Cotton Tshirt - Combo Tshirt - Pack of 3

this worked fine for me:
import requests
from bs4 import BeautifulSoup
url="https://www.amazon.in/BULLMER-Cotton-Printed-T-shirt-Multicolour/dp/B0892SZX7F/ref=sr_1_4?c=ts&dchild=1&keywords=Men%27s+T-Shirts&pf_rd_i=1968024031&pf_rd_m=A1VBAL9TL5WCBF&pf_rd_p=8b97601b-3643-402d-866f-95cc6c9f08d4&pf_rd_r=EPY70Y57HP1220DK033Y&pf_rd_s=merchandised-search-6&qid=1596817115&refinements=p_72%3A1318477031&s=apparel&sr=1-4&ts_id=1968123031"
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"}
http_proxy = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy = "ftp://10.10.1.10:3128"
proxyDict = {
"http" : http_proxy,
"https" : https_proxy,
"ftp" : ftp_proxy
}
page = requests.get(url, headers=headers)
print(page.status_code)
soup = BeautifulSoup(page.content, "lxml")
#print(soup.prettify())
title = soup.find(id = "productTitle")
if title:
title = title.get_text()
else:
title = "default_title"
print(title)

Incomprehensible parser behavior

Help me please! I programmed a simple parser, but it does not work correctly, and I do not know what this is connected with.
import requests
from bs4 import BeautifulSoup
URL = 'https://stopgame.ru//topgames'
HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0', 'accept': '*/*'}
HOST = 'https://stopgame.ru'
def get_html(url, params=None):
r = requests.get(url, headers=HEADERS, params=params)
return r
def get_content(html):
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('a', class_="lent-block game-block")
print(items)
def parse():
html = get_html(URL)
if html.status_code == 200:
items = get_content(html.text)
else:
print('Error')
parse()
I've got this output :
[]
Process finished with exit code 0

items = soup.find_all('a', class_="lent-block game-block")
You are trying to find out "lent-block game-block" class for anchor
tag which actually is not there in html and hence you are getting
blank list.
Try with this div item you will get the list of matched items.
items = soup.find_all('div', class_="lent-block lent-main")

Shortened link not working with BeautifulSoup Python

This code gets the information from the site perfectly fine:
url = 'https://www.vogue.com/article/mamma-mia-2-here-we-go-again-review?mbid=social_twitter'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
title = soup.find("meta", {"name": "twitter:title"})
title2 = soup.find("meta", property="og:title")
title3 = soup.find("meta", property="og:description")
print("TITLE: "+str(title['content']))
print("TITLE2: "+str(title2['content']))
print("TITLE3: "+str(title3['content']))
However, when I replace the url with this shortened link it returns:
print("TITLE: "+str(title['content']))
TypeError: 'NoneType' object has no attribute '__getitem__'

The url-shortener sends a meta-refresh to redirect to desired page. This code should help:
from bs4 import BeautifulSoup
import requests
import re
shortened_url = '<YOUR SHORTENED URL>'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
response = requests.get(shortened_url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
while True:
# is meta refresh there?
if soup.select_one('meta[http-equiv=refresh]'):
refresh_url = re.search(r'url=(.*)', soup.select_one('meta[http-equiv=refresh]')['content'], flags=re.I)[1]
response = requests.get(refresh_url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
else:
break
title = soup.find("meta", {"name": "twitter:title"})
title2 = soup.find("meta", property="og:title")
title3 = soup.find("meta", property="og:description")
print("TITLE: "+str(title['content']))
print("TITLE2: "+str(title2['content']))
print("TITLE3: "+str(title3['content']))
Prints:
TITLE: Mamma Mia! Here We Go Again Is the Only Good Thing About This Summer - Vogue
TITLE2: Mamma Mia! Here We Go Again Is the Only Good Thing About This Summer
TITLE3: Is it possible to change your country of origin to a movie sequel?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding appropriate text - python

Related

How to grab data from inputDataGroup from a url

how to get the content of a title using BeautifulSoup4 and requests

Why can't I scrape Amazon products by BeautifulSoup?

Incomprehensible parser behavior

Shortened link not working with BeautifulSoup Python

Categories

Resources