Get HTML class attribute by a the content with bs4 webscraping - python

So I'm currently trying to get a certain attribute just by the content of the HTML element.
I know how to get an attribute by another attribute in the same HTML section. But this time I need the attribute by the content of the section.
"https://www.skatedeluxe.ch/de/adidas-skateboarding-busenitz-vulc-ii-schuh-white-collegiate-navy-bluebird_p155979?cPath=216&value[55][]=744" this is the link I try to scrape.
So I'm trying to get the "data-id" just by the " US 12"
What I tried to do is getting it similar to how I'd get an attribute by an attribute.
This is my code:
def carting ():
a = session.get(producturl, headers=headers, proxies=proxy)
soup = BeautifulSoup(a.text, "html.parser")
product_id = soup.find("div", {"class" : "product-grid"})["data-product-id"]
option_id = soup.find("option", {"option" : " US 12"})["data-id"]
print(option_id)
carting()
This is what I get:
'NoneType' object is not subscriptable
I know that the code is wrong and doesn't work like I wrote it but I cannot figure how else I'm supposed to do it.
Would appreciate help and ofc if you need more information just ask.
Kind Regards

Try:
import requests
from bs4 import BeautifulSoup
url = "https://www.skatedeluxe.ch/de/adidas-skateboarding-busenitz-vulc-ii-schuh-white-collegiate-navy-bluebird_p155979?cPath=216&value[55][]=744"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
sizes = soup.select_one("#product-size-chooser")
print(sizes.select_one('option:-soup-contains("US 12")')["data-id"])
Print:
16

I suggest filtering the text using regex as you have whitespaces around it:
soup.find("option", text=re.compile("US 12"))["data-id"]

there are a lot of ways to achieve this:
1st:
you can extract all the options and only pick the one you want with a loop
# find all the option tags that have the attribute "data-id"
for option in soup.find_all("option", attrs={'data-id':True}):
if option.text.strip() == 'US 12':
print(option.text.strip(), '/', option['data-id'])
break
2nd:
you can use a regular expression (regex)
import re
# get the option that has "US 12" in the string
option = soup.find('option', string=re.compile('US 12'))
3rd:
using the CSS selectors
# get the option that has the attribute "data-id" and "US 12" in the string
option = soup.select_one('#product-size-chooser > option[data-id]:-soup-contains-own("US 12")')
I recommend you learn more about CSS selectors

Related

Extracting data from Airtasker using beautiful Soup

I am trying to extract data from this website - https://www.airtasker.com/users/brad-n-11346775/.
So far, I have managed to extract everything except the license number. The problem I'm facing is bizarre as the license number is in the form of text. I was able to extract everything else like the Name, Address etc. For example, to extract the Name, I just did this:
name.append(pro.find('div', class_= 'name').text)
And it works just fine.
This is what I have tried to do, but I'm getting the output as None
license_number.append(pro.find('div', class_= 'sub-text'))
When I do :
license_number.append(pro.find('div', class_= 'sub-text').text)
It gives me the following error:
AttributeError: 'NoneType' object has no attribute 'text'
That means it does not recognise the license number as a text, even though it is a text.
Can someone please give me a workable solution and please tell me what am I doing wrong???
Regards,
The badge with the license number is added to the HTML dynamically from a Boostrap JSON that sits in one of the <script> tags.
You can find the tag with bs4 and scoop out the data with regex and parse it with json.
Here's how:
import ast
import json
import re
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.airtasker.com/users/brad-n-11346775/").text
scripts = BeautifulSoup(page, "lxml").find_all("script")[-4]
bootstrap_JSON = json.loads(
ast.literal_eval(re.search(r"parse\((.*)\)", scripts.string).group(1))
)
print(bootstrap_JSON["profile"]["badges"]["electrical_vic"]["reference_code"])
Output:
Licence No. 28661

Can't find the Web scraping selector for div class

I am new to Web scraping and this is one of my first web scraping project, I cant find the right selector for my soup.select("")
I want to get the "data-phone" (See picture bellow to undersdtand) But it In a div class and after it in a <a href>, who make that a little complicate for me!
I searched online and I foud that I have to use soup.find_all but this is not very helpfull Can anyone help me or give me a quick tip ?Thanks you!
my code:
import webbrowser, requests, bs4, os
url = "https://www.pagesjaunes.ca/search/si/1/electricien/Montreal+QC"
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
result = soup.find('a', {'class', 'mlr__item__cta jsMlrMenu'})
Phone = result['data-phone']
print(Phone)
I think one of the simplest way is to use the soup.select which allows the normal css selectors.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
soup.select('a.mlr__item_cta.jsMlrMenu')
This should return the entire list of anchors from which you can pick the data attribute.
Note I just tried it in the terminal:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Web_scraping'
r = requests.get(url)
soup = BeautifulSoup(r.text)
result = soup.select('a.mw-jump-link') # or any other selector
print(result)
print(result[0].get("href"))
You will have to loop over the result of soup.select and just collect the data-phone value from the attribute.
UPDATE
Ok I have searched in the DOM myself, and here is how I managed to retrieve all the phone data:
anchores = soup.select('a[data-phone]')
for a in anchores:
print(a.get('data-phone'))
It works also with only data selector like this: soup.select('[data-phone]')
Here real proof:
Surprisingly, for me it works also this one with classes:
for a in soup.select('a.mlr__item__cta.jsMlrMenu'):
print(a.get('data-phone'))
There is no surprise, we just had a typo in our first selector...
Find the difference :)
GOOD: a.mlr__item__cta.jsMlrMenu
BAD : a.mlr__item_cta.jsMlrMenu

Python 3 Regular expression on html links

my title may not be the most precise but I had some trouble coming up with a better one and considering it's work hours I'll go with this.
What I am trying to do is get the links from this specific page, then by using RE find specific links that are job ads with certain keywords in it.
Currently I find 2 ads but I haven't been able to get all the ads that match my keyword(in this case it's "säljare", Swedish for sales).
I would appreciate it anyone could look at my RE and say or hint towards fixing it. Thank you!:)
import urllib, urllib.request
import re
from bs4 import BeautifulSoup
url = "https://se.indeed.com/jobb?l=V%C3%A4stra+G%C3%B6talands+L%C3%A4n&start=10&pp=AAoAAAFd6hHqiAAAAAEX-kSOAQABQVlE682pK5mDD9vTZGjJhZBXQGaw6Nf2QaY"
reKey = re.compile('^<a.*?href=\"(.*?)\".*?>(.*säljare.*)</a>')
data = urllib.request.urlopen(url)
dataSoup = BeautifulSoup(data, 'html.parser')
for link in dataSoup.find_all('a'):
linkMatch = re.match(reKey, str(link))
if linkMatch:
print(linkMatch)
print(linkMatch.group(1), linkMatch.group(2))
If I understand your question correctly, you do not need a regex at all. Just check, if the title attribute containing the job title is present in the link and then check for a list of keyword (I added truckförare as a second keyword).
import urllib, urllib.request
import re
import ssl
from bs4 import BeautifulSoup
url = "https://se.indeed.com/jobb?l=V%C3%A4stra+G%C3%B6talands+L%C3%A4n&start=10&pp=AAoAAAFd6hHqiAAAAAEX-kSOAQABQVlE682pK5mDD9vTZGjJhZBXQGaw6Nf2QaY"
keywords = ['säljare', 'truckförare']
data = urllib.request.urlopen(url)
dataSoup = BeautifulSoup(data, 'html.parser')
for link in dataSoup.find_all('a'):
# if we do have a title attribute, check for all keywords
# if at least one of them is present,
# then print the title and the href attribute
if 'title' in link.attrs:
title = link.attrs['title'].lower()
for kw in keywords:
if kw in title:
print(title, link.attrs['href'])
While I personally like regexes (yes, I'm that kind of person ), most of the time you can get away with a little parsing in Python which IMHO makes the code more readable.
Instead of using re can you try in keyword.
for link in dataSoup.find_all('a'):
if keyword in link:
print link
A working solution:
<a[^>]+href=\"([^\"]+)\"[^>]+title=\"((?=[^\"]*säljare[^\"]*)[^\"]+)\"
<a // literal
[^>]+ // 1 or more not '>'
href=\"([^\"]+)\" // href literal then 1 or more not '"' grouped
[^>]+ // 1 or more not '>'
title=\" // literal
( // start of group
(?=[^\"]*säljare[^\"]*) // look ahead and match literal enclosed by 0 or more not '"'
[^\"]+ // 1 or more not '"'
)\" // end of group
Flags: global, case insensitive
Assumes: title after href
Demo

Python: Print Specific line of text out of TD tag

This is an easy one I am sure. I am parsing a website and I am trying to get the specific text in between tags. The text will either == [revoked, Active, Default] I am using Python. I have been able to print out all the inner text results, but I have not been able to find a good solution on the web for specific text. Here is my code
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = urllib2.urlopen("Some URL")
content = url.read()
soup = BeautifulSoup(content)
for tag in soup.findAll(re.compile("^a")):
print(tag.text)
I'm still not sure I understand what you are trying to do, but I'll try to help.
soup.find_all('a', text=['revoked', 'active', 'default'])
This will select only those <a …> tags that have one of given strings as their text.
I've used the snippet below in a similar occasion. See if this works with your goal:
table = soup.find(id="Table3")
for i in table.stripped_strings:
print(i)

Python Regular Expression string exclusion

Using beautfiulsoup to parse sourcecode for scraping:
tempSite = preSite+'/contact_us/'
print tempSite
theTempSite = urlopen(tempSite).read()
currentTempSite = BeautifulSoup(theTempSite)
lightwaveEmail = currentTempSite('input')[7]
#<input type="Hidden" name="bb_recipient" value="comm2342#gmail.com" />
How can I re.compile lightwaveEmail so that only comm2342#gmail.com is printed?
Kinda going about it the wrong way. The reason its the wrong way is that you're using numbered indexes to find the tag you want - BeautifulSoup will find tags for you based on their tag, or attributes which makes it a lot simpler.
You want something like
tempSite = preSite+'/contact_us/'
print tempSite
theTempSite = urlopen(tempSite).read()
soup = BeautifulSoup(theTempSite)
tag = soup.find("input", { "name" : "bb_recipient" })
print tag['value']
If the question is how to get the value attribute from the tag object, then you can use it as a dictionary:
lightwaveEmail['value']
You can find more information about this in the BeautifulSoup documentation.
If the question is how to find in the soup all input tags with such a value, then you can look for them as follows:
soup.findAll('input', value=re.compile(r'comm2342#gmail.com'))
You can find a similar example also in the BeautifulSoup documentation.

Categories