Extracting data from Airtasker using beautiful Soup

Extracting data from Airtasker using beautiful Soup - python

I am trying to extract data from this website - https://www.airtasker.com/users/brad-n-11346775/.
So far, I have managed to extract everything except the license number. The problem I'm facing is bizarre as the license number is in the form of text. I was able to extract everything else like the Name, Address etc. For example, to extract the Name, I just did this:
name.append(pro.find('div', class_= 'name').text)
And it works just fine.
This is what I have tried to do, but I'm getting the output as None
license_number.append(pro.find('div', class_= 'sub-text'))
When I do :
license_number.append(pro.find('div', class_= 'sub-text').text)
It gives me the following error:
AttributeError: 'NoneType' object has no attribute 'text'
That means it does not recognise the license number as a text, even though it is a text.
Can someone please give me a workable solution and please tell me what am I doing wrong???
Regards,

The badge with the license number is added to the HTML dynamically from a Boostrap JSON that sits in one of the <script> tags.
You can find the tag with bs4 and scoop out the data with regex and parse it with json.
Here's how:
import ast
import json
import re
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.airtasker.com/users/brad-n-11346775/").text
scripts = BeautifulSoup(page, "lxml").find_all("script")[-4]
bootstrap_JSON = json.loads(
ast.literal_eval(re.search(r"parse\((.*)\)", scripts.string).group(1))
)
print(bootstrap_JSON["profile"]["badges"]["electrical_vic"]["reference_code"])
Output:
Licence No. 28661

Related

Get HTML class attribute by a the content with bs4 webscraping

So I'm currently trying to get a certain attribute just by the content of the HTML element.
I know how to get an attribute by another attribute in the same HTML section. But this time I need the attribute by the content of the section.
"https://www.skatedeluxe.ch/de/adidas-skateboarding-busenitz-vulc-ii-schuh-white-collegiate-navy-bluebird_p155979?cPath=216&value[55][]=744" this is the link I try to scrape.
So I'm trying to get the "data-id" just by the " US 12"
What I tried to do is getting it similar to how I'd get an attribute by an attribute.
This is my code:
def carting ():
a = session.get(producturl, headers=headers, proxies=proxy)
soup = BeautifulSoup(a.text, "html.parser")
product_id = soup.find("div", {"class" : "product-grid"})["data-product-id"]
option_id = soup.find("option", {"option" : " US 12"})["data-id"]
print(option_id)
carting()
This is what I get:
'NoneType' object is not subscriptable
I know that the code is wrong and doesn't work like I wrote it but I cannot figure how else I'm supposed to do it.
Would appreciate help and ofc if you need more information just ask.
Kind Regards

Try:
import requests
from bs4 import BeautifulSoup
url = "https://www.skatedeluxe.ch/de/adidas-skateboarding-busenitz-vulc-ii-schuh-white-collegiate-navy-bluebird_p155979?cPath=216&value[55][]=744"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
sizes = soup.select_one("#product-size-chooser")
print(sizes.select_one('option:-soup-contains("US 12")')["data-id"])
Print:
16

I suggest filtering the text using regex as you have whitespaces around it:
soup.find("option", text=re.compile("US 12"))["data-id"]

there are a lot of ways to achieve this:
1st:
you can extract all the options and only pick the one you want with a loop
# find all the option tags that have the attribute "data-id"
for option in soup.find_all("option", attrs={'data-id':True}):
if option.text.strip() == 'US 12':
print(option.text.strip(), '/', option['data-id'])
break
2nd:
you can use a regular expression (regex)
import re
# get the option that has "US 12" in the string
option = soup.find('option', string=re.compile('US 12'))
3rd:
using the CSS selectors
# get the option that has the attribute "data-id" and "US 12" in the string
option = soup.select_one('#product-size-chooser > option[data-id]:-soup-contains-own("US 12")')
I recommend you learn more about CSS selectors

Beautiful Soup web scraping complex html for data

Ok so I'm working on a self-directed term project for my college programming course. My plan is to scrape different parts of the overwatch league website for stats etc, save them in a db and then pull from that db with a discord bot. However, I'm running into issues with the website itself. Here's a screenshot of the html for the standings page.
As you can see it's quite convoluted and hard to navigate with the repeated div and body tags and I'm pretty sure it's dynamically created. My prof recommended I find a way to isolate the rank title on the top of the table and then access the parent line and then iterate through the siblings to pull the data such as the team name, position etc into a dictionary for now. I haven't been able to find anything online that helps me, most websites don't provide enough information or are out of date.
Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import link
import re
import pprint
url = 'https://overwatchleague.com/en-us/standings'
response = requests.get(url).text
page = BeautifulSoup(response, features='html.parser')
# for stat in page.find(string=re.compile("rank")):
# statObject = {
# 'standing' : stat.find(string=re.compile, attrs={'class' : 'standings-table-v2styles__TableCellContent-sc-3q1or9-6 jxEkss'}).text.encode('utf-8')
# }
# print(page.find_all('span', re.compile("rank")))
# for tag in page.find_all(re.compile("rank")):
# print(tag.name)
print(page.find(string=re.compile('rank')))
"""
# locate branch with the rank header,
# move up to the parent branch
# iterate through all the siblings and
# save the data to objects
"""
The comments are all failed attempts and all return nothing. the only line not commented out returns a massive json with a lot of unnecessary information which does include what I want to parse out and use for my project. I've linked it as a google doc and highlighted what I'm looking to grab.
I'm not really sure how else to approach this at this point. I've considered using selenium however I lack knowledge of javascript so I'm trying to avoid it if possible. Even if you could comment with some advice on how else to approach this I would greatly appreciate it.
Thank you

As you have noticed, your data is in JSON format. It is embedded in a script tag directly in the page so it's easy to get it using beautifulsoup. Then you need to parse the json to extract all the tables (corresponding to the 3 tabs) :
import requests
from bs4 import BeautifulSoup
import json
url = 'https://overwatchleague.com/en-us/standings'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
script = soup.find("script",{"id":"__NEXT_DATA__"})
data = json.loads(script.text)
tabs = [
i.get("standings")["tabs"]
for i in data["props"]["pageProps"]["blocks"]
if i.get("standings") is not None
]
result = [
{ i["title"] : i["tables"][0]["teams"] }
for i in tabs[0]
]
print(json.dumps(result, indent=4, sort_keys=True))
The above code gives you a dictionnary, the keys are the title of the 3 tabs and the value is the table data

Basic Python webscrape script

This is my first Python web scraping attempt.
I have an IP camera that saves all of its files to an HTML document over HTTP. Essentially the camera is its own server that can be accessed over HTTP. The HTML within the server is very basic. It only includes a single body tag, which contains all of it's clips within this body tag. The files look like:
MP_2018-04-23_11-14-04_60.mov
I am wanting to list/print these files without all of the other HTML associated to it.
import bs4 as bs
import urlib.request
sauce = urllib.request.urlopen('http://192.168.1.99/form/getStorageFileList').read()
soup = bs.BeautifulSoup(sauce,'lxml')
body = soup.body
for paragraph in body.find_all('b'):
print(body.text)
I've included a few screenshots below as the error I am receiving is very lengthy. I am basically getting:
attribute error: module 'html5lib.treebuilders' has no attribute '_base'
Would someone clarify and possibly point me in the right direction?
usr/lib/python3/dist-packages/bs4/builder/_html5lib.py in <module>()
68
69
---> 70 class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder):
71
72 def __init__(self, soup, namespaceHTMLElements):
AttributeError: module 'html5lib.treebuilders' has no attribute '_base'
CameraHTML
Jupyterscript
JupyterscriptOutput

There were a few errors in your script. No big deal though. Also, you might get more benefit from using the Requests library. What about something like this?
from bs4 import BeautifulSoup as bs
import requests
sauce = requests.get('http://192.168.1.99/form/getStorageFileList')
page = sauce.text #Converted page to text
soup = bs(page,'html.parser') #Changed to 'html.parser'
body = soup.body('body') #Added the 'body' tag
for paragraph in body.find_all('b'):
print(paragraph.text) #Grabbed the iterated items & converted them to text
Let me know if this is something you are looking for.

How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

I want to build small tool to help a family member download podcasts off a site.
In order to get the links to the files I first need to filter them out (with bs4 + python3).
The files are on this website (Estonian): Download Page "Laadi alla" = "Download"
So far my code is as follows:
(most of it is from examples on stackoverflow)
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showAll")
content = url.read()
soup = BeautifulSoup(content, "lxml")
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]
print ("Links:", links)
Unfortunately I always get only two results.
Output:
Links: ['http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3']
These are not the ones I want.
My best guess is that the page has somewhat broken html and bs4 / the parser is not able to find anything else.
I've tried different parsers with resulting in no change.
Maybe I'm doing something else wrong too.
My goal is to have the individual links in a list for example.
I'll filter out any duplicates / unwanted entries later myself.
Just a quick note, just in case: This is a public radio and all the content is legally hosted.
My new code is:
for link in soup.find_all('d2p1:DownloadUrl'):
print(link.text)
I am very unsure if the tag is selected correctly.
None of the examples listed in this question are actually working. See the answer below for working code.

Please be aware that the listings from the page are interfaced through an API. So instead of requesting the HTML page, I suggest you to request the API link which has 200 .mp3 links.
Please follow the below steps:
Request the API link, not the HTML page link
Check the response, it's a JSON. So extract the fields that are of your need
Help your Family, All Time :)
Solution
import requests, json
from bs4 import BeautifulSoup
myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showAll=false'
r = requests.get(myurl)
abc = json.loads(r.text)
all_mp3 = {}
for lstngs in abc['ListItems']:
for asd in lstngs['Podcasts']:
all_mp3[asd['DownloadUrl']] = lstngs['Header']
all_mp3
all_mp3 is what you need. all_mp3 is a dictionary with download urls as keys and mp3 names as the values.

Python: Print Specific line of text out of TD tag

This is an easy one I am sure. I am parsing a website and I am trying to get the specific text in between tags. The text will either == [revoked, Active, Default] I am using Python. I have been able to print out all the inner text results, but I have not been able to find a good solution on the web for specific text. Here is my code
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = urllib2.urlopen("Some URL")
content = url.read()
soup = BeautifulSoup(content)
for tag in soup.findAll(re.compile("^a")):
print(tag.text)

I'm still not sure I understand what you are trying to do, but I'll try to help.
soup.find_all('a', text=['revoked', 'active', 'default'])
This will select only those <a …> tags that have one of given strings as their text.

I've used the snippet below in a similar occasion. See if this works with your goal:
table = soup.find(id="Table3")
for i in table.stripped_strings:
print(i)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data from Airtasker using beautiful Soup - python

Related

Get HTML class attribute by a the content with bs4 webscraping

Beautiful Soup web scraping complex html for data

Basic Python webscrape script

How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

Python: Print Specific line of text out of TD tag

Categories

Resources