i always met one problem, when I scraping one web page.
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
anyone can tell me how to solve this? my code as below:
import requests
r = requests.get('https://www.example.com')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
records = []
for result in results:
name = results.find('div', attrs={'class':'name'}).text
price = results.find('div', attrs={'class':'price'}).text[13:-11]
records.append((name, price,))
I want to ask a close question.If I want to scrap multiple pages.the pattern like below,I use the code as below,but still scrap the first page only Can you solve this issue.
import requests
for i in range(100):
url = "https://www.example.com/a/a_{}.format(i)"
r = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
Try this. You mixed up results with result:
import requests
r = requests.get('https://www.example.com')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
records = []
for result in results:
name = result.find('div', attrs={'class':'name'}).text # result not results
price = result.find('div', attrs={'class':'price'}).text[13:-11]
records.append((name, price,))
Try this, remove 's' in 'results' in particularly name = results
your error code "name = results.find('div', attrs={'class':'name'}).text"
with one changes "name = result.find('div', attrs={'class':'name'}).text"
well, nice try!
import requests
r = requests.get('https://www.example.com')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
records = []
for result in results:
name = result.find('div', attrs={'class':'name'}).text
price = result.find('div', attrs={'class':'price'}).text[13:-11]
records.append((name, price,))
Related
I am building a quite simple beautifulsoup/requests web scraper, but when running it on a jobs website, the error
AttributeError: 'NoneType' object has no attribute 'find_all'
appears.
Here is my code:
import requests
from bs4 import BeautifulSoup
URL = "https://uk.indeed.com/jobs?q&l=Norwich%2C%20Norfolk&vjk=139a4549fe3cc48b"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")
job_elements = results.find_all("div", class_="resultContent")
python_jobs = results.find_all("h2", string="Python")
for job_element in job_elements:
title_element = job_element.find("h2", class_="jobTitle")
company_element = job_element.find("span", class_="companyName")
location_element = job_element.find("div", class_="companyLocation")
print(title_element)
print(company_element)
print(location_element)
print()
Does anyone know what the issue is?
Check your selector for results attribute id should be resultsBody. The wrong selector causes the error in lines that uses results, cause None do not has attributes:
results = soup.find(id="resultsBody")
and also job_elements it is an td not a div:
job_elements = results.find_all("td", class_="resultContent")
You could also chain the selectors with css selectors:
job_elements = soup.select('#resultsBody td.resultContent')
Getting only these that contains Python:
job_elements = soup.select('#resultsBody td.resultContent:has(h2:-soup-contains("Python"))')
Example
import requests
from bs4 import BeautifulSoup
URL = "https://uk.indeed.com/jobs?q&l=Norwich%2C%20Norfolk&vjk=139a4549fe3cc48b"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="resultsBody")
job_elements = results.find_all("td", class_="resultContent")
python_jobs = results.find_all("h2", string="Python")
for job_element in job_elements:
title_element = job_element.find("h2", class_="jobTitle")
company_element = job_element.find("span", class_="companyName")
location_element = job_element.find("div", class_="companyLocation")
print(title_element)
print(company_element)
print(location_element)
print()
I have this code
import requests
from bs4 import BeautifulSoup
result = requests.get("http://www.cvbankas.lt/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for article_tag in soup.find_all("article"):
a_tag = article_tag.find('a')
urls.append(a_tag.attrs['href'])
div_tag = article_tag.find('span')
urls.append(div_tag.attrs['class'])
print(urls)
Can anyone explane me how to get the data marked in red?
You can get span with the class label "salary_amount"
salary_object = article_tag.find("span", class_= "salary_amount")
and then extract the text with the .text attribute of the created object.
I want to scrap multiple pages from one site.the pattern like this:
https://www.example.com/S1-3-1.html https://www.example.com/S1-3-2.html https://www.example.com/S1-3-3.html https://www.example.com/S1-3-4.html https://www.example.com/S1-3-5.html.
I tried three method to scrape all of these pages once, but every method only scrape the first page. I show the code below, and anyone can check and tell me what is the problem will be highly appreciated.
===============method 1====================
import requests
for i in range(5): # Number of pages plus one
url = "https://www.example.com/S1-3-{}.html".format(i)
r = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
===============method 2=============
import urllib2,sys
from bs4 import BeautifulSoup
for numb in ('1', '5'):
address = ('https://www.example.com/S1-3-' + numb + '.html')
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html,'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
=============method 3==============
import requests
from bs4 import BeautifulSoup
url = 'https://www.example.com/S1-3-1.html'
for round in range(5):
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
paging = soup.select('div.paging a')
next_url = 'https://www.example.com/'+paging[-1]['href'] # paging[-1]['href'] is next page button on the page
url = next_url
I checked some answers and checked, but it is not loop problem, please check image shown below,it is only first page results. it is really me annoyed several days
please see photo:only first page results,
results picture 2
Your indentation is out of order.
try(Method 1)
from bs4 import BeautifulSoup
import requests
for i in range(1, 6): # Number of pages plus one
url = "https://www.example.com/S1-3-{}.html".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
Your page analysis should be inside the loop, like this, otherwise, it will only use one page:
.......
for i in range(5): # Number of pages plus one
url = "https://www.example.com/S1-3-{}.html".format(i)
r = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
........
Firstly, you have to introduce all orders inside of the loop, otherwise, only will work with the last iteration.
Second,
You could try closing the requests session at the end of each iteration:
import requests
for i in range(5): # Number of pages plus one
url = "https://www.example.com/S1-3-{}.html".format(i)
r = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
r.close()
I am new to python and web-scraping. I am trying to scrape a website (link is the url). I am getting an error as "'NoneType' object is not iterable", with the last line of below code. Could anyone point what could have gone wrong?
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = 'https://labtestsonline.org/tests-index'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
# Function to get hyper-links for all test components
hyperlinks = []
def parseUrl(url):
global hyperlinks
page = requests.get(url).content
soup = BeautifulSoup(page, 'lxml')
for a in soup.findAll('div',{'class':'field-content'}):
a = a.find('a')
href = urlparse.urljoin(Url,a.get('href'))
hyperlinks.append(href)
parseUrl(url)
# function to get header and common questions for each test component
def header(url):
page = requests.get(url).content
soup = BeautifulSoup(page, 'lxml')
h = []
commonquestions = []
for head in soup.find('div',{'class':'field-item'}).find('h1'):
heading = head.get_text()
h.append(heading)
for q in soup.find('div',{'id':'Common_Questions'}):
questions = q.get_text()
commonquestions.append(questions)
for i in range(0, len(hyperlinks)):
header(hyperlinks[i])
Below is the traceback error:
<ipython-input-50-d99e0af6db20> in <module>()
1 for i in range(0, len(hyperlinks)):
2 header(hyperlinks[i])
<ipython-input-49-15ac15f9071e> in header(url)
5 soup = BeautifulSoup(page, 'lxml')
6 h = []
for head in soup.find('div',{'class':'field-item'}).find('h1'):
heading = head.get_text()
h.append(heading)
TypeError: 'NoneType' object is not iterable
soup.find('div',{'class':'field-item'}).find('h1') is returning None. First check whether the function returns anything before looping over it.
Something like:
heads = soup.find('div',{'class':'field-item'}).find('h1')
if heads:
for head in heads:
# remaining code
Try this. It should solve the issues you are having at this moment. I used css selector to get the job done.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
link = 'https://labtestsonline.org/tests-index'
page = requests.get(link)
soup = BeautifulSoup(page.content, 'lxml')
for a in soup.select('.field-content a'):
new_link = urljoin(link,a.get('href')) ##joining broken urls so as to reuse these
response = requests.get(new_link) ##sending another http requests
sauce = BeautifulSoup(response.text,'lxml')
for item in sauce.select("#Common_Questions .field-item"):
print(item.text)
print("<<<<<<<<<>>>>>>>>>>>")
when i run this code it gives me an empty bracket. Im new to web scraping so i dont know what im doing wrong.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=laptop'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
container = soup.findAll('li', {'class': 's-result-item celwidget '})
#btw the space is also there in the html code
print(container)
results:
[]
What i tried is to grab the html code from the site, and to soup trough the li tags where all the information is stored so I can print out all the information in a for loop.
Also if someone wants to explain how to use BeautifulSoup we can always talk.
Thank you guys.
So a working code that grabs product and price would could look something like this.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=laptop'
r = requests.get(url, headers={'User-Agent': 'Mozilla Firefox'})
soup = BeautifulSoup(r.text, 'html.parser')
container = soup.findAll('li', {'class': 's-result-item celwidget '})
for cont in container:
h2 = cont.h2.text.strip()
# Amazon lists prices in two ways. If one fails, use the other
try:
currency = cont.find('sup', {'class': 'sx-price-currency'}).text.strip()
price = currency + cont.find('span', {'class': 'sx-price-whole'}).text.strip()
except:
price = cont.find('span', {'class': 'a-size-base a-color-base'})
print('Product: {}, Price: {}'.format(h2, price))
Let me know if that helps you further...