can you please help me with my python code? I wanted to parse several homepages with beautiful soup provided in the list html with the function stars
html=["https://www.onvista.de/aktien/fundamental/Adidas-Aktie-DE000A1EWWW0", "https://www.onvista.de/aktien/fundamental/Allianz-Aktie-DE0008404005", "https://www.onvista.de/aktien/fundamental/BASF-Aktie-DE000BASF111"]
def stars(html):
bsObj = BeautifulSoup(html.read())
starbewertung = bsObj.findAll("section")[8].findAll("div")[1].findAll("span")[16]
str_cells = str(starbewertung)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
lst=[]
lst.append(cleantext)
stars(html)
Instead I am getting an error "AttributeError: 'list' object has no attribute 'read'"
As some of the comments mentioned you need to use the requests library to actually grab the content of each link in your list.
import requests
from bs4 import BeautifulSoup
html=["https://www.onvista.de/aktien/fundamental/Adidas-Aktie-DE000A1EWWW0", "https://www.onvista.de/aktien/fundamental/Allianz-Aktie-DE0008404005", "https://www.onvista.de/aktien/fundamental/BASF-Aktie-DE000BASF111"]
def stars(html):
for url in html:
resp = requests.get(url)
bsObj = BeautifulSoup(resp.content, 'html.parser')
print(bsObj) # Should print the entire html document.
# Do other stuff with bsObj here.
stars(html)
The IndexError from bsObj.findAll("section")[8].findAll("div")[1].findAll("span")[16] is something you'll need to figure out yourself.
You have a couple of errors here.
you are trying to load the whole list of pages into BeautifulSoup. You should process page by page.
You should get the source code of the page before processing it.
there is no "section" element on the page you are loading, so you will get an exception as you are trying to get the 8th element. So you might need to evaluate whether you found anything.
def stars(html):
request = requests.get(html)
if request.status_code != 200:
return
page_content = request.content
bsObj = BeautifulSoup(page_content)
starbewertung = bsObj.findAll("section")[8].findAll("div")[1].findAll("span")[16]
str_cells = str(starbewertung)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
for page in html:
stars(page)
Related
I am trying to practice webscrapping on Google's Jobs webpage.
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=google+jobs&rlz=1C5CHFA_enUS993US993&oq=google+jobs&aqs=chrome..69i57j0i512j0i433i512l2j0i512j69i60l3.1100j0j9&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&ved=2ahUKEwj-9Mak9tX4AhVqIEQIHX7jCQgQutcGKAF6BAgDEAY&sxsrf=ALiCzsb2SZksAxw0NnFdLD3WRkzUb7z0PA:1656617817988#htivrt=jobs&htidocid=9DZhMax31VQAAAAAAAAAAA%3D%3D&fpstate=tldetail'
page = requests.get(url)
Everything is working fine up to this point. If I print the page variable, I get response 200.
Next,
pageContent = BeautifulSoup(page.content, 'html.parser')
jobList = pageContent.find_all('div', attrs = {'class': 'BjJfJf PUpOsf'})
Now printing jobList just prints an empty list.
I've tried other formats:
jobList = pageContent.find_all('section', class_='BjJfJf PUpOsf')
jobList = pageContent.find_all('div', class_='BjJfJf PUpOsf')
jobList = pageContent.find_all('section', class_='BjJfJf PUpOsf')
jobList = pageContent.find_all('section', class_='BjJfJf PUpOsf ')
They all print an empty list as well.
So, I did further investigation and checked to see if the html produced in this line:
pageContent = BeautifulSoup(page.content, 'html.parser')
...produced the same html with all the divs and what not. Even checked the divs of the html in pageContent:
contentDiv = pageContent.find_all("div")
for div in contentDiv:
print(div.get("class"))
Obviously the class I am looking at, BjJfJf PUpOsf, wasn't listed. These were however:
None
None
['w43NHb']
['qDdeqd']
['diIKdd']
None
['SC3FEd']
['oLm9bf']
So, the html that was gathered through BeautifulSoup appears to be different from the html I see when I do inspect element on the page or when I view page source.
My question is, how can I actually access that class or any class that is formatted similar to it so I can print it to console or otherwise use it for gathering data for future analysis?
I want to get text from a website using bs4 but I keep getting this error and I don't know why. This is the error: TypeError: slice indices must be integers or None or have an index method.
This is my code:
from urllib.request import urlopen
import bs4
url = "https://www.oddsshark.com/nfl/dallas-pittsburgh-odds-august-5-2021-1410371"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
text = html.find("div", {"class":"gc-score__title"})#the error is in this line
print(text)
On this line:
text = html.find("div", {"class":"gc-score__title"})
you just use str.find method, not bs4.BeautifulSoup.find method
So if you do
soup = bs4.BeautifulSoup(html, 'html.parser')
text = soup.find("div", {"class":"gc-score__title"})
print(text)
you will get rid of the error.
That said, the site is using JavaScript, so this will not yield what you expect. You will need to use tools like Selenium to scrape this site.
First, if you want BeautifulSoup to parse the data, you need to ask it to do that.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.oddsshark.com/nfl/dallas-pittsburgh-odds-august-5-2021-1410371"
page = urlopen(url)
html_bytes = page.read()
soup = BeautifulSoup(html_bytes)
Then you can used soup.find to find <div> tags:
text = soup.find("div", {"class":"gc-score__title"})
That will eliminate the error. You were calling str.find because html is a string, and to pick tags out you need to call the find method of a bs4.BeautifulSoup object.
But besides eliminating the error, that line won't do what you want. It won't return anything, because the data at that url does not contain the tag <div class="gc-score__title">.
Copy the contents of html_bytes to a text editor to confirm this.
I have a code which I wrote with the help of this community. (shoutout to #chitown88)
Now I want to implement the same method for scraping photos on the pages. one example is the following URL:
https://www.meisamatr.com/fa/product/%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C/%D9%84%D9%88%D8%A7%D8%B2%D9%85-%D8%AC%D8%A7%D9%86%D8%A8%DB%8C-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C/%D8%A8%D8%B1%D8%B3%D8%8C-%D8%A7%D9%BE%D9%84%DB%8C%DA%A9%D8%A7%D8%AA%D9%88%D8%B1-%D9%88-%D8%B3%D8%A7%DB%8C%D8%B1/4647-%D8%A7%D8%B3%D9%86%D8%B3-%D8%A8%D8%B1%D8%B3-%D9%84%D8%A8-%D9%87%D9%84%D9%88-%D9%87%D9%BE%DB%8C%D9%86%D8%B3.html
I want to download the full-size picture which can be found if we inspect element on the picture:
<img src="https://www.meisamatr.com/upload/thumb1/product/1518428319.jpg" alt="اسنس برس لب هلو هپینس" title="اسنس برس لب هلو هپینس" class="thumb" data-large-img-url="https://www.meisamatr.com/upload/product/1518428319.jpg" id="magnifier-item-0">
the following URL is what we need:
data-large-img-url="https://www.meisamatr.com/upload/product/1518428319.jpg"
Let's suppose we have a file called links.txt which looks like this
https://www.meisamatr.com/fa/product/آرایشی/آرایش-صورت/کانسیلر/6494-اسنس-کانسیلر-کموفلاژ-با-پوشانندگی-کامل-10.html
https://www.meisamatr.com/fa/product/آرایشی/آرایش-صورت/کانسیلر/6493-اسنس-کانسیلر-کموفلاژ-با-پوشانندگی-کامل-05.html
https://www.meisamatr.com/fa/product/آرایشی/آرایش-صورت/کرم-پودر/6492-اسنس-هایلایتر-برنزه-کننده-مایع.html
https://www.meisamatr.com/fa/product/آرایشی/آرایش-صورت/پودر-صورت/6491-اسنس-پودر-فشرده-صورت-10.html
.
.
.
The following is what I came up with. But it shows "No connection adapters were found for" error.
What do you suggest? Thank you in advance for your time.
>>> import requests
>>> import urllib.request
>>> from bs4 import BeautifulSoup
with open('links.txt','r') as f:
urls = f.read().split()
for url in urls:
try:
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
page = soup.find_all('div', class_='slick-slide slick-active')
pic = page.find('img', class_='thumb')['data-large-img-url']
print(pic)
urllib.request.urlretrieve(pic, "local-filename.jpg")
except Exception as e:
print(e)
break
you're not too far off. Again just need to adjust to get those specific tags.
soup.find_all will return a list of elements. You'll need to iterate through those to get what you want. However, in the links I've seen, what you are trying to get is in the first instance of that tag, so I simply changed from .find_all() to just .find()
I also removed the break. With the break, your loop will stop once it reaches an error from a link, even if it didn't get through all the links. Removeing the break with let it continue to the other links even if it comes across a link that doesn't pull the image/url.
You also may want to have the image file name you save as to be dynamic, otherwise it'll overwrite each time:
import requests
import urllib.request
from bs4 import BeautifulSoup
with open('links.txt','r', encoding="utf8") as f:
urls = f.read().split()
num = 1
for url in urls:
try:
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
page = soup.find('div', class_='single_slideshow_big')
pic = page.find('img')['data-large-img-url']
print(pic)
urllib.request.urlretrieve(pic, "local-filename_%02d.jpg" %(num))
num += 1
except Exception as e:
print(e)
I'm writing a program to parse webpages to snatch the title and headers so I can give SEO consultation without manually clicking through all the code.
The code works, but only returns a single instance of each tag I'm looking for. If there are, say, 5 h1's in the HTML, I only get the first one. How do I get the rest? I'm thinking a loop but I'm not sure how to go about it.
Here's the code:
# import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
#specify URL
quote_page = input('What URL would you like to scrape?')
#query website and return HTML to the variable page
page = urlopen(quote_page)
#parse the HTML with BeautifulSoup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
#now we have the HTML as soup, so we need to grab the title and headers
title = soup.find('title')
h1s = soup.find('h1')
h2s = soup.find('h2')
h3s = soup.find('h3')
metadescription = soup.find('meta name="description"')
#print out the data in readable format, including "none" for missing data
#types
print()
print('Title:')
print(title)
print()
print('H1s:')
print(h1s)
print()
print('H2s:')
print(h2s)
print()
print('H3s:')
print(h3s)
print()
print('Description:')
print(metadescription)
Use soup.find_all('h1') to get them all.
I am trying to automate the process of obtaining the number of followers different twitter accounts using the page source.
I have the following code for one account
from bs4 import BeautifulSoup
import requests
username='justinbieber'
url = 'https://www.twitter.com/'+username
r = requests.get(url)
soup = BeautifulSoup(r.content)
for tag in soup.findAll('a'):
if tag.has_key('class'):
if tag['class'] == 'ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav u-textUserColor':
if tag['href'] == '/justinbieber/followers':
print tag.title
break
I am not sure where did I went wrong. I understand that we can use Twitter API to obtain the number of followers. However, I wish to try to obtain it through this method as well to try it out. Any suggestions?
I've modified the code from here
If I were you, I'd be passing the class name as an argument to the find() function instead of find_all() and I'd first look for the <li> element that contains the anchor you're loooking for. It'd look something like this
from bs4 import BeautifulSoup
import requests
username='justinbieber'
url = 'https://www.twitter.com/'+username
r = requests.get(url)
soup = BeautifulSoup(r.content)
f = soup.find('li', class_="ProfileNav-item--followers")
title = f.find('a')['title']
print title
# 81,346,708 Followers
num_followers = int(title.split(' ')[0].replace(',',''))
print num_followers
# 81346708
PS findAll() was renamed to find_all() in bs4