Basic Python webscrape script - python

This is my first Python web scraping attempt.
I have an IP camera that saves all of its files to an HTML document over HTTP. Essentially the camera is its own server that can be accessed over HTTP. The HTML within the server is very basic. It only includes a single body tag, which contains all of it's clips within this body tag. The files look like:
MP_2018-04-23_11-14-04_60.mov
I am wanting to list/print these files without all of the other HTML associated to it.
import bs4 as bs
import urlib.request
sauce = urllib.request.urlopen('http://192.168.1.99/form/getStorageFileList').read()
soup = bs.BeautifulSoup(sauce,'lxml')
body = soup.body
for paragraph in body.find_all('b'):
print(body.text)
I've included a few screenshots below as the error I am receiving is very lengthy. I am basically getting:
attribute error: module 'html5lib.treebuilders' has no attribute '_base'
Would someone clarify and possibly point me in the right direction?
usr/lib/python3/dist-packages/bs4/builder/_html5lib.py in <module>()
68
69
---> 70 class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder):
71
72 def __init__(self, soup, namespaceHTMLElements):
AttributeError: module 'html5lib.treebuilders' has no attribute '_base'
CameraHTML
Jupyterscript
JupyterscriptOutput

There were a few errors in your script. No big deal though. Also, you might get more benefit from using the Requests library. What about something like this?
from bs4 import BeautifulSoup as bs
import requests
sauce = requests.get('http://192.168.1.99/form/getStorageFileList')
page = sauce.text #Converted page to text
soup = bs(page,'html.parser') #Changed to 'html.parser'
body = soup.body('body') #Added the 'body' tag
for paragraph in body.find_all('b'):
print(paragraph.text) #Grabbed the iterated items & converted them to text
Let me know if this is something you are looking for.

Related

Web scraping youtube page

i'm trying to get the title of youtube videos given a link.
But i'm unable to access the element that hold the title. I'm using bs4 to parse the html.
I noticed im unable to access any element that is within 'ytd-app' tag in the youtube page.
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
a = soup.find_all(attrs={"class": "style-scope ytd-video-primary-info-renderer"})
print(a)
So how can i get the video title ? Is there something i'm doing wrong or youtube intentionally created a tag like this to prevent web_scraping ?
See class that you are using is render through Javascript and all the contents are dynamic so it is very difficult to find any data using bs4
So what you can do find data in soup by manually and find particular tag
Also you can try out with pytube
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
soup.find("title").get_text()

Extracting data from Airtasker using beautiful Soup

I am trying to extract data from this website - https://www.airtasker.com/users/brad-n-11346775/.
So far, I have managed to extract everything except the license number. The problem I'm facing is bizarre as the license number is in the form of text. I was able to extract everything else like the Name, Address etc. For example, to extract the Name, I just did this:
name.append(pro.find('div', class_= 'name').text)
And it works just fine.
This is what I have tried to do, but I'm getting the output as None
license_number.append(pro.find('div', class_= 'sub-text'))
When I do :
license_number.append(pro.find('div', class_= 'sub-text').text)
It gives me the following error:
AttributeError: 'NoneType' object has no attribute 'text'
That means it does not recognise the license number as a text, even though it is a text.
Can someone please give me a workable solution and please tell me what am I doing wrong???
Regards,
The badge with the license number is added to the HTML dynamically from a Boostrap JSON that sits in one of the <script> tags.
You can find the tag with bs4 and scoop out the data with regex and parse it with json.
Here's how:
import ast
import json
import re
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.airtasker.com/users/brad-n-11346775/").text
scripts = BeautifulSoup(page, "lxml").find_all("script")[-4]
bootstrap_JSON = json.loads(
ast.literal_eval(re.search(r"parse\((.*)\)", scripts.string).group(1))
)
print(bootstrap_JSON["profile"]["badges"]["electrical_vic"]["reference_code"])
Output:
Licence No. 28661

Unable to extract text from a specific span tag using Python's Beautiful Soup

I am currently scraping this website to build a car dataset and I have an equation built to loop through each page of the website while scraping. However, I am unable to extract the text I need to make this work.
The below code snippet is the tag that I am trying to scrape. I need to get the number of vehicles on the site.
<span class="d-none d-sm-inline">166 Vehicles</span>
This image shows the site's element that I am trying to scrape
Below is my code that I am using to scrape that element:
# Packages
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
print("Started web scrape...")
limit = 10
start = 0 #increment by limit
website = requests.get(f'https://www.sosubaru.com/new-inventory/index.htm?start={start}')
soup = BeautifulSoup(website.text, 'html.parser')
inventory_count = soup.select("span.d-none.d-sm-inline")[0].string
print(inventory_count)
This code returns the following:
Started OR_GP_Roe_Motors web scrape...
Traceback (most recent call last):
File "c:/mypath...", line 16, in <module>
inventory_count = soup.select("span.d-none.d-sm-inline")[0].string
IndexError: list index out of range
Then I checked to see why I was getting that error code by returning everything that the soup.select gave me:
inventory_count = soup.select("span.d-none.d-sm-inline")
print(inventory_count)
which returned:
Started web scrape...
[]
Why is it giving me an empty list?
I then told it to print out every span tag on the website to see if it was there. The result printed out many span tags but didn't include the one I am looking for. Why can't I detect it with beautiful soup? Is it the parser I am using? I tried using 'lxml' as the parser but it didn't change anything. Does it have anything to do with the fact that the website is an html xmls doc?
I have already scraped a few websites and haven't had any problems like this until now.
The data and tag you want don't appear in the html source, which means they are being added by javascript. You can either use selenium to get the page source after it has been rendered or you can use requests_html, which has an API similar to BeautifulSoup and it has the option to render a page's javascript before scraping it.
from requests_html import HTMLSession
s = HTMLSession()
r = s.get(url)
r.html.render()
r.find . . . [whatever you want to search for]

Scraping data from span tag with class using BS

Im trying to scrape project's names from Gitlab. When I inspect source code I see that name of project is in:
<span class='project-name'>Project Name</span>
Unfortunately, when I try to scrape this date I got empty list, My code looks like:
url = 'https://gitlab.com/users/USER/projects'
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(source,'lxml')
repos = [repo.text for repo in soup.find_all('span',{'class':'project-name'})]
I was trying other solutions like using attrs, class_ or using other HTML tags, but nothing works. What can be wrong here?
Ok, so it looks like when you inspect the page in the network tabs in chrome developer tools, you can see that projects are not rendered when request is made:
What that means is, project information is requested after. In order to get the projects you need to send a request to https://gitlab.com/users/USER/projects.json endpoint:
After that you can inspect the response from that endpoint. As you can see the response here is json so we can load json data with json module and then in that dictionary there is an entry called html which has html data in it, so we can parse it with beautifulsoup and the rest of the code stays the same:
import bs4 as bs
import urllib, json
url = 'https://gitlab.com/users/USER/projects.json'
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(json.loads(source)["html"],'html.parser')
repos = [repo.text for repo in soup.find_all('span',{'class':'project-name'})]
print(repos)
Output:
['freebsd', 'freebsd-ports', 'freebsd-test', 'risc-vhdl', 'dotfiles', 'tideyBot']

Python scraping website with flight tickets

I am trying to extract information about prices of flight tickets with a python script. Please take a look at the picture:
I would like to parse all the prices (such as "121" at the bottom of the tree). I have constructed a simple script and my problem is that I am not sure how to get the right parts from the code behind page's "inspect element". My code is below:
import urllib3
from bs4 import BeautifulSoup as BS
http = urllib3.PoolManager()
ULR = "https://greatescape.co/?datesType=oneway&dateRangeType=exact&departDate=2019-08-19&origin=EAP&originType=city&continent=europe&flightType=3&city=WAW"
response = http.request('GET', URL)
soup = BS(response.data, "html.parser")
body = soup.find('body')
__next = body.find('div', {'id':'__next'})
ui_container = __next.find('div', {'class':'ui-container'})
bottom_container_root = ui_container.find('div', {'class':'bottom-container-root'})
print(bottom_container_root)
The problem is that I am stuck at the level of ui-container. bottom-container-root is an empty variable, despite it is a direct child under ui-container. Could someone please let me know how to parse this tree properly?
I have no experience in web scraping, but as it happens it is one step in a bigger workflow I am building.
.find_next_siblings and .next_element can be useful in navigating through containers.
Here is some example usage below.
from bs4 import BeautifulSoup
html = open("small.html").read()
soup = BeautifulSoup(html)
print soup.head.next_element
print soup.head.next_element.next_element

Categories