Get URL from BeautifulSoup object - python

Somebody is handing my function a BeautifulSoup object (BS4) that he has gotten using the typical call:
soup = BeautifulSoup(url)
my code:
def doSomethingUseful(soup):
url = soup.???
How do I get the original URL from the soup object? I tried reading the docs AND the BeautifulSoup source code... I'm still not sure.

If the url variable is a string of an actual URL, then you should just forget the BeautifulSoup here and use the same variable url. You should be using BeautifulSoup to parse HTML code, not a simple URL. In fact, if you try to use it like this, you get a warning:
>>> from bs4 import BeautifulSoup
>>> url = "https://foo"
>>> soup = BeautifulSoup(url)
C:\Python27\lib\site-packages\bs4\__init__.py:336: UserWarning: "https://foo" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
Since the URL is just a string, BeautifulSoup doesn't really know what to do with it when you "soupify" it, except for wrapping it up in basic HTML:
>>> soup
<html><body><p>https://foo</p></body></html>
If you still wanted to extract the URL from this, you could just use .text on the object, since it's the only thing in there:
>>> print(soup.text)
https://foo
If on the other hand url is not really a URL at all but rather a bunch of HTML code (in which case the variable name would be very misleading), then how you'd extract a specific link inside would beg the question of how it's in your code. Doing a find to get the first a tag, then extracting the href value would be one way.
>>> actual_html = '<html><body>My link text</body></html>'
>>> newsoup = BeautifulSoup(actual_html)
>>> newsoup.find('a')['href']
'http://moo'

Related

Python HTML parsing: removing excess HTML from get request output

I am wanting to make a simple python script to automate the process of pulling .mov files from an IP camera's SD card. The Model of IP camera supports http requests which returns HTML that contains the .mov file info. My python script so far..
from bs4 import BeautifulSoup
import requests
page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
OUTPUT:
NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov
I want to only return the MOV file. So removing:
"NAME2041=Record_continiously/2018-06-02/8/"
I'm new to HTML parsing with python so I'm a bit confused with the functionality.
Is returned HTML considered a string? If so, I understand that it will be immutable and I will have to create a new string instead of "striping away" the preexisting string.
I have tried:
page.replace("NAME2041=Record_continiously/2018-06-02/8/","")
in which I receive an attribute error. Is anyone aware of any method that could accomplish this?
Here is a sample of the HTML I am working with...
<html>
<head></head>
<body>
000 Success NUM=2039 NAME0=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-17-38_60.mov SIZE0=15736218
NAME1=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-16-37_60.mov SIZE1=15683077
NAME2=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-15-36_60.mov SIZE2=15676882
NAME3=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-14-35_60.mov SIZE3=15731539
</body>
</html>
Use str.split with negative indexing.
Ex:
page = "NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov"
print( page.split("/")[-1])
Output:
MP_2018-06-03_00-33-15_60.mov
as you asked for explanation of your code here it is:
# import statements
from bs4 import BeautifulSoup
import requests
page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3") # returns response object
soup = BeautifulSoup(page.content, 'html.parser') #
page.content returns string content of response
you are passing this(page.content) string content to class BeautifulSoup which is initialized with two arguments your content(page.content) as string and parser here it is html.parser
soup is the object of BeautifulSoup
.prettify() is method used to pretty print the content
In string slicing you may get failure of result due to length of content so it's better to split your content as suggested by #Rakesh and that's the best approach in your case.

Python BS4 with SDMX

I would like to retrieve data given in a SDMX file (like https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx&mode=its). I tried to use BeautifulSoup, but it seems, it does not see the tags. In the following the code
import urllib2
from bs4 import BeautifulSoup
url = "https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx"
html_source = urllib2.urlopen(url).read()
soup = BeautifulSoup(html_source, 'lxml')
ts_series = soup.findAll("bbk:Series")
which gives me an empty object.
Is BS4 the wrong tool, or (more likely) what am I doing wrong?
Thanks in advance
soup.findAll("bbk:series") would return the result.
In fact, in this case, even you use lxml as the parser, BeautifulSoup still parse it as html, since html tags are case insensetive, BeautifulSoup downcases all the tags, thus soup.findAll("bbk:series") works. See Other parser problems from the official doc.
If you want to parse it as xml, use soup = BeautifulSoup(html_source, 'xml') instead. It also uses lxml since lxml is the only xml parser BeautifulSoup has. Now you can use ts_series = soup.findAll("Series") to get the result as beautifulSoup will strip the namespace part bbk.

Opening webpage and returning a dict of all the links and their text

I'm trying to open a webpage and return all the links as a dictionary that would look like this.
{"http://my.computer.com/some/file.html" : "link text"}
So the link would be after the href= and the text would be between the > and the </a>
I'm using https://www.yahoo.com/ as my test website
I keep getting a this error:
'href=' in line:
TypeError: a bytes-like object is required, not 'str'
Heres my code:
def urlDict(myUrl):
url = myUrl
page = urllib.request.urlopen(url)
pageText = page.readlines()
urlList = {}
for line in pageText:
if '<a href=' in line:
try:
url = line.split('<a href="')[-1].split('">')[0]
txt = line.split('<a href="')[-1].split('">')[-1].split('< /a>')[0]
urlList[url] = txt
except:
pass
return urlList
What am I doing wrong? I've looked around and people have mostly suggest this mysoup parser thing. I'd use it, but I don't think that would fly with my teacher.
The issue is that you're attempting to compare a byte string to a regular string. If you add print(line) as the first command in your for loops, you'll see that it will print a string of HTML but it will have a b' at the beginning, indicating it's not utf-8 encoding. This makes things difficult. The proper way to use urllib here is the following:
def url_dict(myUrl):
with urllib.request.urlopen(myUrl) as f:
s = f.read().decode('utf-8')
This will have the s variable hold the entire text of the page. You can then use a regular expression to parse out the links and the link target. Here is an example which will pull the link targets without the HTML.
import urllib.request
import re
def url_dict():
# url = myUrl
with urllib.request.urlopen('http://www.yahoo.com') as f:
s = f.read().decode('utf-8')
r = re.compile('(?<=href=").*?(?=")')
print(r.findall(s))
url_dict()
Using regex to get both the html and the link itself in a dictionary is outside the scope of where you are in your class, so I would absolutely not recommend submitting it for the assignment, although I would recommend learning it for later use.
You'll want to use BeautifulSoup as suggested, as it make this entire thing extremely easy. There is an example in the docs that you can cut and paste to extract the URLs.
For what it's worth, here is a BeautifulSoup and requests approach.
Feel free to replace requests with urllib, but BeautifulSoup doesn't really have a nice replacement.
import requests
from bs4 import BeautifulSoup
def get_links(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
return { a_tag['href']: a_tag.text for a_tag in soup.find_all('a') }
for link, text in get_links('https://www.yahoo.com/').items():
print(text.strip(), link)

BeautifulSoup4 find all tags with attribute begins with a string in Python

How can I use beautifulsoup to find all tags with attributes that begins with some string?
The following doesn't seem to work :(
soup.find_all('a', {'href':re.compile('^com')})
It seems to work as expected... I think It doesn't work in your case is because your example is wrong. Because normally a href tag doesn't begin with com they usually begin with either http or https
Running your example against your own question, it works as expected:
import requests
from bs4 import BeautifulSoup
import re
html = requests.get("http://stackoverflow.com/questions/24416106/beautifulsoup4-find-all-tags-with-attribute-begins-with-a-string-in-python")
soup = BeautifulSoup(html.text)
http = soup.find('a', {'href':re.compile('^http')})
print http
Produces:
<a data-gps-track="site_switcher.click({ item_type:6 })" href="http://chat.stackoverflow.com">chat</a>
And if you replace ^http with ^https you'll get a a tag with a href that begins with https
Note: I used the find() method for simplicity

Return last URL in sequence of redirects

I sometimes need to parse with Beautiful Soup and Requests URLs that are provided as such:
http://bit.ly/sdflksdfwefwe
http://stup.id/sdfslkjsfsd
http://0.r.msn.com/sdflksdflsdj
Of course, these URLs generally 'resolve' to a canonical URL some as http://real-website.com/page.html. How can I get the last URL in the resolution / redirect chain?
My code generally looks like this:
from bs4 import BeautifulSoup
import requests
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)
canonical_url = response.??? ## This is what I need to know
Note that I don't mean to query http://bit.ly/bllsht to see where it goes, but rather when I am using Beautiful Soup to already parse the page that it returns, to also get the canonical URL that was the last in the redirect chain.
Thanks.
It's in the url attribute of your response object.
>>> response = requests.get('http://bit.ly/bllsht')
>>> response.url
> u'http://www.thenews.org/sports/well-hey-there-murray-state-1-21-11-1.2436937'
You could easily find this information in the “Quick Start” page.

Categories