Extract List Values using Beautiful Soup in Python - python

Please help me extract the information from the below list code output. I want to extract the count i.e. "4" and text "bds" from the abbr from the below output:
[<ul class="list-card-details">
<li>4<abbr class="list-card-label"> <!-- -->bds</abbr></li>
<li>4<abbr class="list-card-label"> <!-- -->ba</abbr></li>
<li>2,482<abbr class="list-card-label"> <!-- -->sqft</abbr></li>
</ul>]
I got the above output by running the below code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
list_info = soup.find_all('div', class_='list-card-info')
house_one = list_info[0]
house_one_price = house_one.find('div', class_='list-card-price').text
house_one_bds_count = house_one.find_all('ul', class_='list-card-details')
print(house_one_bds_count)
#house_one_bds = house_one.li('abbr', class_='list-card-label').text -- working fine, so I commented
it to incorporate it later into the code
#print(house_one_price) -- working fine, so I commented it to incorporate it later into the code
Also can you please guide me why I get an error of "AttributeError: ResultSet object has no attribute 'content'. You're probably treating a list of elements like a single element. Did
you call find_all() when you meant to call find()?" when I add .text in print(house_one_bds_count.text).
Will be very thankful. I am new to stack overflow so apologies if the formatting is not correct. Thanks in advance.

Related

Get HTML class attribute by a the content with bs4 webscraping

So I'm currently trying to get a certain attribute just by the content of the HTML element.
I know how to get an attribute by another attribute in the same HTML section. But this time I need the attribute by the content of the section.
"https://www.skatedeluxe.ch/de/adidas-skateboarding-busenitz-vulc-ii-schuh-white-collegiate-navy-bluebird_p155979?cPath=216&value[55][]=744" this is the link I try to scrape.
So I'm trying to get the "data-id" just by the " US 12"
What I tried to do is getting it similar to how I'd get an attribute by an attribute.
This is my code:
def carting ():
a = session.get(producturl, headers=headers, proxies=proxy)
soup = BeautifulSoup(a.text, "html.parser")
product_id = soup.find("div", {"class" : "product-grid"})["data-product-id"]
option_id = soup.find("option", {"option" : " US 12"})["data-id"]
print(option_id)
carting()
This is what I get:
'NoneType' object is not subscriptable
I know that the code is wrong and doesn't work like I wrote it but I cannot figure how else I'm supposed to do it.
Would appreciate help and ofc if you need more information just ask.
Kind Regards
Try:
import requests
from bs4 import BeautifulSoup
url = "https://www.skatedeluxe.ch/de/adidas-skateboarding-busenitz-vulc-ii-schuh-white-collegiate-navy-bluebird_p155979?cPath=216&value[55][]=744"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
sizes = soup.select_one("#product-size-chooser")
print(sizes.select_one('option:-soup-contains("US 12")')["data-id"])
Print:
16
I suggest filtering the text using regex as you have whitespaces around it:
soup.find("option", text=re.compile("US 12"))["data-id"]
there are a lot of ways to achieve this:
1st:
you can extract all the options and only pick the one you want with a loop
# find all the option tags that have the attribute "data-id"
for option in soup.find_all("option", attrs={'data-id':True}):
if option.text.strip() == 'US 12':
print(option.text.strip(), '/', option['data-id'])
break
2nd:
you can use a regular expression (regex)
import re
# get the option that has "US 12" in the string
option = soup.find('option', string=re.compile('US 12'))
3rd:
using the CSS selectors
# get the option that has the attribute "data-id" and "US 12" in the string
option = soup.select_one('#product-size-chooser > option[data-id]:-soup-contains-own("US 12")')
I recommend you learn more about CSS selectors

Extracting data from Airtasker using beautiful Soup

I am trying to extract data from this website - https://www.airtasker.com/users/brad-n-11346775/.
So far, I have managed to extract everything except the license number. The problem I'm facing is bizarre as the license number is in the form of text. I was able to extract everything else like the Name, Address etc. For example, to extract the Name, I just did this:
name.append(pro.find('div', class_= 'name').text)
And it works just fine.
This is what I have tried to do, but I'm getting the output as None
license_number.append(pro.find('div', class_= 'sub-text'))
When I do :
license_number.append(pro.find('div', class_= 'sub-text').text)
It gives me the following error:
AttributeError: 'NoneType' object has no attribute 'text'
That means it does not recognise the license number as a text, even though it is a text.
Can someone please give me a workable solution and please tell me what am I doing wrong???
Regards,
The badge with the license number is added to the HTML dynamically from a Boostrap JSON that sits in one of the <script> tags.
You can find the tag with bs4 and scoop out the data with regex and parse it with json.
Here's how:
import ast
import json
import re
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.airtasker.com/users/brad-n-11346775/").text
scripts = BeautifulSoup(page, "lxml").find_all("script")[-4]
bootstrap_JSON = json.loads(
ast.literal_eval(re.search(r"parse\((.*)\)", scripts.string).group(1))
)
print(bootstrap_JSON["profile"]["badges"]["electrical_vic"]["reference_code"])
Output:
Licence No. 28661

Can't traverse into element to scrape rotten tomatoes ratings data using Beautiful Soup and Selenium

I am trying to get to the element that contains the rating data, but I can't figure out how to traverse into it (image linked below). The span element for both the critic rating and audience rating is in the same class (mop-ratings-wrap__percentage). I tried to get the elements by separately traversing into their respective divs ('mop-ratings-wrap__half' and 'mop-ratings-wrap__half audience-score') but I am getting this error:
runfile('/Users/*/.spyder-py3/temp.py', wdir='/Users/*/.spyder-py3')
Traceback (most recent call last):
File "/Users/*/.spyder-py3/temp.py", line 22, in <module>
cr=a.find('span', attrs={'class':'mop-ratings-wrap__percentage'})
TypeError: find() takes no keyword arguments
Here is my code:
# -*- coding: utf-8 -*-
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome("/Users/*/Downloads/chromedriver")
critics_rating=[]
audience_rating=[]
driver.get("https://www.rottentomatoes.com/m/bill_and_ted_face_the_music")
content = driver.page_source
soup = BeautifulSoup(content, "lxml")
for a in soup.find('div', attrs={'class':'mop-ratings-wrap__half'}):
cr=a.find('span', attrs={'class':'mop-ratings-wrap__percentage'})
critics_rating.append(cr.text)
for b in soup.find('div', attrs={'class':'mop-ratings-wrap__half audience-score'}):
ar=b.find('span', attrs={'class':'mop-ratings-wrap__percentage'})
audience_rating.append(ar.text)
print(critics_rating)
I am following this article: https://www.edureka.co/blog/web-scraping-with-python/#demo
And here is the data I want to extract
I suspect that
soup.find()
returns a string rather than a bs4 object like you are expecting. Therefore you are calling
"somestring".find()
which takes no keyword arguments.
(I would comment this but I lack reputation, sorry)
The issue is in your loop for a in soup.find('div', attrs={'class':'mop-ratings-wrap__half'}): you have returned one single element and then try to traverse through it, which is making equivalent to traversing through each letter of returned string element. Now you can not run find method on letters. Solution If you want to loop through element(s) to use find method on top of them, use find_all instead. As it will return a list of webelements, which you can traverse one by one using loop.
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
ratings =[]
for a in soup.find_all('div', attrs={'class':'mop-ratings-wrap__half'}):
cr=a.find('span', attrs={'class':'mop-ratings-wrap__percentage'})
ratings.append(cr.text)
for rating in ratings:
print(rating.replace("\n", "").strip())
Output: Above code will print :
Note : To print your desired result, above is not the most shopiscated way of doing. But i have tried to answer your doubt rather than giving a better solution. You can use ratings[0] to print critic rating and ratings[1] to print user rating.

Scraping PFR Football Data with Python for a Beginner

background: i'm trying to scrape some tables from this pro-football-reference page. I'm a complete newbie to Python, so a lot of the technical jargon ends up lost on me but in trying to understand how to solve the issue, i can't figure it out.
specific issue: because there are multiple tables on the page, i can't figure out how to get python to target the one i want. I'm trying to get the Defense & Fumbles table. The code below is what i've got so far, and it's from this tutorial using a page from the same site- but one that only has a single table.
sample code:
#url we are scraping
url = "https://www.pro-football-reference.com/teams/nwe/2017.htm"
#html from the given url
html=urlopen(url)
# make soup object of html
soup = BeautifulSoup(html)
# we see that soup is a beautifulsoup object
type(soup)
#
column_headers = [th.getText() for th in
soup.findAll('table', {"id": "defense").findAll('th')]
column_headers #our column headers
attempts made: I realized that the tutorial's method would not work for me, so i attempted to change the soup.findAll portion to target the specific table. But i repeatedly get an error saying:
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
when changing it to find, the error becomes:
AttributeError: 'NoneType' object has no attribute 'find'
I'll be absolutely honest that i have no idea what i'm doing or what these mean. I'd appreciate any help in figuring how to target that data and then scrape it.
Thank you,
your missing a "}" in the dict after the word "defense". Try below and see if it works.
column_headers = [th.getText() for th in
soup.findAll('table', {"id": "defense"}).findAll('th')]
First off, you want to use soup.find('table', {"id": "defense"}).findAll('th') - find one table, then find all of its 'th' tags.
The other problem is that the table with id "defense" is commented out in the html on that page:
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_defense">
<table class="sortable stats_table" id="defense" data-cols-to-freeze=2><caption>Defense & Fumbles Table</caption>
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
etc. I assume that javascript is un-hiding it. BeautifulSoup doesn't parse the text of comments, so you'll need to find the text of all the comments on the page as in this answer, look for one with id="defense" in it, and then feed the text of that comment into BeautifulSoup.
Like this:
from bs4 import Comment
comments = comments = soup.findAll(text=lambda text:isinstance(text, Comment))
defenseComment = next(c for c in comments if 'id="defense"' in c)
defenseSoup = BeautifulSoup(str(defenseComment))

Python: Print Specific line of text out of TD tag

This is an easy one I am sure. I am parsing a website and I am trying to get the specific text in between tags. The text will either == [revoked, Active, Default] I am using Python. I have been able to print out all the inner text results, but I have not been able to find a good solution on the web for specific text. Here is my code
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = urllib2.urlopen("Some URL")
content = url.read()
soup = BeautifulSoup(content)
for tag in soup.findAll(re.compile("^a")):
print(tag.text)
I'm still not sure I understand what you are trying to do, but I'll try to help.
soup.find_all('a', text=['revoked', 'active', 'default'])
This will select only those <a …> tags that have one of given strings as their text.
I've used the snippet below in a similar occasion. See if this works with your goal:
table = soup.find(id="Table3")
for i in table.stripped_strings:
print(i)

Categories