Python beautiful soup/ Find

Python beautiful soup/ Find - python

My Question is that when i print the links list out it prints a handsome list in the terminal but i don't know why the links(list) does not contain the find method?
Moreover this same code was working on my teachers's ide
import requests
from bs4 import BeautifulSoup
param = {'s': 'zombie'}
r = requests.get('http://chilltime.pk/search', params=param
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find('tbody')
links = soup.findAll('td')
for i in links:
item_text = i.find('a').text
item_href = i.find('a').attrs['href']
if item_text and item_href:
print(item_text)
print(item_href)
ERROR:
**Traceback (most recent call last):
File "C:/Users/AFFAN ULHAQ/PycharmProjects/Beautiful/bsp.py", line 19, in <module>
item_text = i.find('a').text
AttributeError: 'NoneType' object has no attribute 'text'**

import requests
from bs4 import BeautifulSoup
params = {
's': 'zombie'
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}
def main(url):
r = requests.get(url, params=params, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
target = soup.findAll("a", href=True)
for tar in target:
print(tar.text, tar['href'])
main("http://chilltime.pk/search")

It most likely the "i" variable iterating links does not have the attribute "a", that is, there isn't a link inside your html cell. Maybe it's you can check if you really have a link
for i in links:
item_text = i.find('a').text if i.find('a') else False
item_href = i.find('a').attrs['href'] if i.find('a') else False

Related

Why am I getting Attribute error: nonetype object has no attribute get_text whenever I try to scrape this ecommerce store

I'm trying to scrape an ecommerce store but getting Attribute error: nonetype object has no attribute get_text. This happens whenever i try to iterate between each products through the product link. I'm confused if am running into a javascript or captcha or whatnot don't know. Here's my code
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.jumia.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
productlinks = []
for x in range(1,51):
r = requests.get(f'https://www.jumia.com.ng/ios-phones/?page={x}#catalog-listing/')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('article', class_='prd _fb col c-prd')
for product in productlist:
for link in product.find_all('a', href=True):
productlinks.append(baseurl + link['href'])
for link in productlinks:
r = requests.get(link, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
name = soup.find('h1', class_='-fs20 -pts -pbxs').get_text(strip=True)
amount = soup.find('span', class_='-b -ltr -tal -fs24').get_text(strip=True)
review = soup.find('div', class_='stars _s _al').get_text(strip=True)
rating = soup.find('a', class_='-plxs _more').get_text(strip=True)
features = soup.find_all('li', attrs={'style': 'box-sizing: border-box; padding: 0px; margin: 0px;'})
a = features[0].get_text(strip=True)
b = features[1].get_text(strip=True)
c = features[2].get_text(strip=True)
d = features[3].get_text(strip=True)
e = features[4].get_text(strip=True)
f = features[5].get_text(strip=True)
print(f"Name: {name}")
print(f"Amount: {amount}")
print(f"Review: {review}")
print(f"Rating: {rating}")
print('Key Features')
print(f"a: {a}")
print(f"b: {b}")
print(f"c: {c}")
print(f"d: {d}")
print(f"e: {e}")
print(f"f: {f}")
print('')
Here's the error message:
Traceback (most recent call last):
File "c:\Users\LP\Documents\jumia\jumia.py", line 32, in <module>
name = soup.find('h1', class_='-fs20 -pts -pbxs').get_text(strip=True)
AttributeError: 'NoneType' object has no attribute 'get_text'
PS C:\Users\LP\Documents\jumia> here

Change the variable baseurl to https://www.jumia.com.ng and change the features variable to features = soup.find('article', class_='col8 -pvs').find_all('li'). After fixing those two issues, you'll probably get an IndexError because not every page has six features listed. You can use something like the following code to iterate through the features and print them:
for i, feature in enumerate(features):
print(chr(ord("a")+i) + ":", feature.get_text(strip=True))
With this for loop, you don't need the a to f variables. The chr(ord("a")+i part gets the letter corresponding to index i. However, if there are more than 26 features this will print punctuation characters or garbage. This can be trivially fixed by breaking the loop when i>25. This trick won't work on EBCDIC systems, only ASCII ones.
Even after making these three changes, there was an AttributeError when it tried to scrape a link to a product unrelated to iPhones, which showed up on page 5 of the results. I don't know how the script got that link; it was a medicinal cream. To fix that, either wrap the body of the second for loop in a try except like the following or put the last line of the first for loop under a if 'iphone' in link.
for link in productlinks:
try:
# body of for loop goes here
except AttributeError:
continue
With these changes, the script would look like this:
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.jumia.com.ng'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
productlinks = []
for x in range(1,51):
r = requests.get(f'https://www.jumia.com.ng/ios-phones/?page={x}#catalog-listing/')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('article', class_='prd _fb col c-prd')
for product in productlist:
for link in product.find_all('a', href=True):
if 'iphone' in link['href']:
productlinks.append(baseurl + link['href'])
for link in productlinks:
r = requests.get(link, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
try:
name = soup.find('h1', class_='-fs20 -pts -pbxs').get_text(strip=True)
amount = soup.find('span', class_='-b -ltr -tal -fs24').get_text(strip=True)
review = soup.find('div', class_='stars _s _al').get_text(strip=True)
rating = soup.find('a', class_='-plxs _more').get_text(strip=True)
features = soup.find('article', class_='col8 -pvs').find_all('li')
print(f"Name: {name}")
print(f"Amount: {amount}")
print(f"Review: {review}")
print(f"Rating: {rating}")
print('Key Features')
for i, feature in enumerate(features):
if i > 25: # we ran out of letters
break
print(chr(ord("a")+i) + ":", feature.get_text(strip=True))
print('')
except AttributeError:
continue

Problems with getting data from a page using python, beautiful soup

I am trying to explore the web scraping in python.Currently working with beautiful soup.I was trying to get names of the festivals from this site : https://www.skiddle.com/festivals .Everything was going pretty fine, except 1 page, this one: https://www.skiddle.com/festivals/front-end-data-test/. It says 'NoneType' object has no attribute 'find' any way i can get data from there?
Here is the code
import requests
from bs4 import BeautifulSoup
import lxml
import json
headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 OPR/89.0.4447.64"
}
#collect all fests URLs
fests_urls_list = []
#for i in range(0, 120, 24):
for i in range(0, 24, 24):
url = f"https://www.skiddle.com/festivals/search/?ajaxing=1&sort=0&fest_name=&from_date=15%20Aug%202022&to_date=&maxprice=500&o={i}&bannertitle=August"
req = requests.get(url=url, headers=headers)
json_data = json.loads(req.text)
html_response = json_data["html"]
with open(f"data/index_{i}.html", "w", encoding="utf-8") as file:
file.write(html_response)
with open(f"data/index_{i}.html", "r", encoding="utf-8") as file:
src = file.read()
soup = BeautifulSoup(src, "lxml")
cards = soup.find_all("a", class_="card-details-link")
for item in cards:
fest_url = "https://www.skiddle.com" + item.get("href")
fests_urls_list.append(fest_url)
#collect fest info
for url in fests_urls_list:
req = requests.get(url=url, headers=headers)
try:
soup = BeautifulSoup(req.text, "lxml")
fest_name = soup.find("div", class_="MuiContainer-root MuiContainer-maxWidthFalse css-1krljt2").find("h1").text.strip()
fest_data = soup.find("div", class_="MuiGrid-root MuiGrid-item MuiGrid-grid-xs-11 css-twt0ol").text.strip()
print(fest_data)
except Exception as ex :
print(ex)
print("This was not supposed to happen")

How to get attribute from element using beautifulsoup?

Here's a bit html from a web page:
<bg-quote class="value negative" field="Last" format="0,0.00" channel="/zigman2/quotes/203558040/composite,/zigman2/quotes/203558040/lastsale" data-last-stamp="1624625999626" data-last-raw="671.68">671.68</bg-quote>
So I want to get the value of attribute "data-last-raw", but find() -method seems to return None when searching for this element. Why is this and how can I fix it?
My code and Traceback below:
import requests
from bs4 import BeautifulSoup as BS
import tkinter as tk
class Scraping:
#classmethod
def get_to_site(cls, stock_name):
sitename = 'https://www.marketwatch.com/investing/stock/tsla' + stock_name
site = requests.get(sitename, headers={
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7",
"Connection":"keep-alive",
"Host":"www.marketwatch.com",
"Referer":"https://www.marketwatch.com",
"Upgrade-Insecure-Requests":"1",
"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36"
})
print(site.status_code)
src = site.content
Scraping.get_price(src)
#classmethod
def get_price(cls, src):
soup = BS(src, "html.parser")
price_holder = soup.find("bg-quote", {"channel":"/zigman2/quotes/203558040/composite,/zigman2/quotes/203558040/lastsale"})
price = price_holder["data-last-raw"]
print(price)
Scraping.get_to_site('tsla')
200
Traceback (most recent call last):
File "c:\Users\Aatu\Documents\python\pythonleikit\stock_price_scraper.py", line 41, in <module>
Scraping.get_to_site('tsla')
File "c:\Users\Aatu\Documents\python\pythonleikit\stock_price_scraper.py", line 30, in get_to_site
Scraping.get_price(src)
File "c:\Users\Aatu\Documents\python\pythonleikit\stock_price_scraper.py", line 36, in get_price
price = price_holder["data-last-raw"]
TypeError: 'NoneType' object is not subscriptable
So site.status_code returns 200 to indicate that the site is opened correctly, but I think the soup.find() -method returns None to indicate that the element I was looking for was not found.
Somebody pls help!

import requests
from bs4 import BeautifulSoup
def main(ticker):
r = requests.get(f'https://www.marketwatch.com/investing/stock/{ticker}')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('bg-quote.value:nth-child(2)').text)
if __name__ == "__main__":
main('tsla')
Output:
670.99

Unable to scrape the name from the inner page of each result using requests

I've created a script in python making use of post http requests to get the search results from a webpage. To populate the results, it is necessary to click on the fields sequentially shown here. Now a new page will be there and this is how to populate the result.
There are ten results in the first page and the following script can parse the results flawlessly.
What I wish to do now is use the results to reach their inner page in order to parse Sole Proprietorship Name (English) from there.
website address
I've tried so far with:
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.businessregistration.moc.gov.kh/cambodia-master/service/create.html?targetAppCode=cambodia-master&targetRegisterAppCode=cambodia-br-soleproprietorships&service=registerItemSearch"
payload = {
'QueryString': '0',
'SourceAppCode': 'cambodia-br-soleproprietorships',
'OriginalVersionIdentifier': '',
'_CBASYNCUPDATE_': 'true',
'_CBHTMLFRAG_': 'true',
'_CBNAME_': 'buttonPush'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
res = s.get(url)
target_url = res.url.split("&")[0].replace("view.", "update.")
node = re.findall(r"nodeW\d.+?-Advanced",res.text)[0].strip()
payload['_VIKEY_'] = re.findall(r"viewInstanceKey:'(.*?)',", res.text)[0].strip()
payload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()
payload[node] = 'N'
payload['_CBNODE_'] = re.findall(r"Callback\('(.*?)','buttonPush", res.text)[2]
payload['_CBHTMLFRAGNODEID_'] = re.findall(r"AsyncWrapper(W\d.+?)'",res.text)[0].strip()
res = s.post(target_url,data=payload)
soup = BeautifulSoup(res.content, 'html.parser')
for item in soup.find_all("span", class_="appReceiveFocus")[3:]:
print(item.text)
How can I parse the Name (English) from each of the results inner page using requests?

This is one of the ways you can parse the name from the site's inner page and then email address from the address tab. I added this function .get_email() only because I wanted to let you know as to how you can parse content from different tabs.
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.businessregistration.moc.gov.kh/cambodia-master/service/create.html?targetAppCode=cambodia-master&targetRegisterAppCode=cambodia-br-soleproprietorships&service=registerItemSearch"
result_url = "https://www.businessregistration.moc.gov.kh/cambodia-master/viewInstance/update.html?id={}"
base_url = "https://www.businessregistration.moc.gov.kh/cambodia-br-soleproprietorships/viewInstance/update.html?id={}"
def get_names(s):
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
res = s.get(url)
target_url = result_url.format(res.url.split("id=")[1])
soup = BeautifulSoup(res.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['QueryString'] = 'a'
payload['SourceAppCode'] = 'cambodia-br-soleproprietorships'
payload['_CBNAME_'] = 'buttonPush'
payload['_CBHTMLFRAG_'] = 'true'
payload['_VIKEY_'] = re.findall(r"viewInstanceKey:'(.*?)',", res.text)[0].strip()
payload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()
payload['_CBNODE_'] = re.findall(r"Callback\('(.*?)','buttonPush", res.text)[-1]
payload['_CBHTMLFRAGNODEID_'] = re.findall(r"AsyncWrapper(W\d.+?)'",res.text)[0].strip()
res = s.post(target_url,data=payload)
soup = BeautifulSoup(res.text,"lxml")
payload.pop('_CBHTMLFRAGNODEID_')
payload.pop('_CBHTMLFRAG_')
payload.pop('_CBHTMLFRAGID_')
for item in soup.select("a[class*='ItemBox-resultLeft-viewMenu']"):
payload['_CBNAME_'] = 'invokeMenuCb'
payload['_CBVALUE_'] = ''
payload['_CBNODE_'] = item['id'].replace('node','')
res = s.post(target_url,data=payload)
soup = BeautifulSoup(res.text,'lxml')
address_url = base_url.format(res.url.split("id=")[1])
node_id = re.findall(r"taba(.*)_",soup.select_one("a[aria-label='Addresses']")['id'])[0]
payload['_CBNODE_'] = node_id
payload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()
payload['_CBNAME_'] = 'tabSelect'
payload['_CBVALUE_'] = '1'
eng_name = soup.select_one(".appCompanyName + .appAttrValue").get_text()
yield from get_email(s,eng_name,address_url,payload)
def get_email(s,eng_name,url,payload):
res = s.post(url,data=payload)
soup = BeautifulSoup(res.text,'lxml')
email = soup.select_one(".EntityEmailAddresses:contains('Email') .appAttrValue").get_text()
yield eng_name,email
if __name__ == '__main__':
with requests.Session() as s:
for item in get_names(s):
print(item)
Output are like:
('AMY GEMS', 'amy.n.company#gmail.com')
('AHARATHAN LIN LIANJIN FOOD FLAVOR', 'skykoko344#gmail.com')
('AMETHYST DIAMOND KTV', 'twobrotherktv#gmail.com')

To get the Name (English) you can simply replace print(item.text) with print(item.text.split('/')[1].split('(')[0].strip()) which prints AMY GEMS

Incomprehensible parser behavior

Help me please! I programmed a simple parser, but it does not work correctly, and I do not know what this is connected with.
import requests
from bs4 import BeautifulSoup
URL = 'https://stopgame.ru//topgames'
HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0', 'accept': '*/*'}
HOST = 'https://stopgame.ru'
def get_html(url, params=None):
r = requests.get(url, headers=HEADERS, params=params)
return r
def get_content(html):
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('a', class_="lent-block game-block")
print(items)
def parse():
html = get_html(URL)
if html.status_code == 200:
items = get_content(html.text)
else:
print('Error')
parse()
I've got this output :
[]
Process finished with exit code 0

items = soup.find_all('a', class_="lent-block game-block")
You are trying to find out "lent-block game-block" class for anchor
tag which actually is not there in html and hence you are getting
blank list.
Try with this div item you will get the list of matched items.
items = soup.find_all('div', class_="lent-block lent-main")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python beautiful soup/ Find - python

Related

Why am I getting Attribute error: nonetype object has no attribute get_text whenever I try to scrape this ecommerce store

Problems with getting data from a page using python, beautiful soup

How to get attribute from element using beautifulsoup?

Unable to scrape the name from the inner page of each result using requests

Incomprehensible parser behavior

Categories

Resources