Extract URL using beautifulsoup 4

Extract URL using beautifulsoup 4 - python

I'm trying to extract a URL using BS4, I can get to the correct location but I'm not sure how to remove the '' tags from URL. I tried adding .text however this just returned nothing.
vid_screenshot = (soup('a', {'class':'mp4Thumb'}))[0].contents[0]
>> <img src="www.fgfg.com/dsfasdf.jpg"/>
desired result
>> www.fgfg.com/dsfasdf.jpg
did not work, returned nothing.
(soup('a', {'class':'mp4Thumb'}))[0].contents[0].text
Would anyone know how to strip these tags..?

You have the HTML tag, you need to take the src attribute:
vid_screenshot = soup('a', {'class':'mp4Thumb'})[0].contents[0]
vid_screenshot_src = wid_screenshot['src']
This assumes there is always going to be a src attribute on the tag. You can also use the .get() method to return None if the attribute is not present:
vid_screenshot = soup('a', {'class':'mp4Thumb'})[0].contents[0]
vid_screenshot_src = wid_screenshot.get('src')
or you can give .get() a second argument to return if the attribute is missing:
vid_screenshot = soup('a', {'class':'mp4Thumb'})[0].contents[0]
vid_screenshot_src = wid_screenshot.get('src', 'default value')
See the Attributes documentation section.

Related

Beatifulsoup doesnt return href it returns None

>>> soup_brand
<a data-role="BRAND" href="/URL/somename">
Some Name
</a>
>>> type(soup_brand)
<class 'bs4.BeautifulSoup'>
>>> print(soup_brand.get('href'))
None
Documentation followed: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Hi people from all over the world,
does someone now whats going wrong or am I targeting the object wrong ?
Need to get the href.

Have you tried:
soup.find_all(name="a")
or
soup.select_one(selector="a")
it should also be possible to catch with
all_anchor_tags = soup.find_all(name="a")
for tag in all_anchor_tags:
print(tag.get("href")) #prints the href element of each a tag, thus each link
Although the all bs4 looks for multiple elemnts (the reason why we have a loop here) I encountered, that bs4 sometime is better in catching things, if you give it a search for all approach and then iterate over the elements

in order to apply ['href'] the object must be <bs4.Element.Tag>.
so, try this:
string = \
"""
<a data-role="BRAND" href="/URL/somename">
Some Name
</a>
"""
s = BeautifulSoup(string)
a_tag = s.find('a')
print(a_tag["href"])
out
/URL/somename
or if you have multiple a tags you can try this:
a_tags = s.findAll('a')
for a in a_tags:
print(a.get("href"))
out
/URL/somename

String argument behaves differently within my script

I've tried to parse a text out of some html elements using string argument the way it is described here but failed miserably. I've tried two different ways but every time I encountered the same AttributeError.
How can I use string argument in this very case to fetch the text?
I've tried with:
import re
from bs4 import BeautifulSoup
htmlelement = """
<caption>
<span class="toggle open"></span>
ASIC registration
</caption>
"""
soup = BeautifulSoup(htmlelement,"lxml")
try:
item = soup.find("caption",string="ASIC registration").text
#item = soup.find("caption",string=re.compile("ASIC registration",re.I)).text
except AttributeError:
item = ""
print(item)
Expected output (only using string argument):
ASIC registration

How can I use string argument in this very case to fetch the text?
You can't
Note:
I am assuming that you mean by some change string parameter in
item = soup.find("caption",string="ASIC registration").text
As given in the documentation
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:
import re
from bs4 import BeautifulSoup
htmlelement = """
<caption>
<span class="toggle open"></span>
ASIC registration
</caption>
"""
soup = BeautifulSoup(htmlelement,"lxml")
item = soup.find("caption")
print(item.string)
Output
None
Here the .string is None as caption has more than one child.
If you are trying to get the parent (caption tag in this case) with the text, you could do
item = soup.find(string=re.compile('ASIC registration')).parent
which will give
<caption><a></a>ASIC registration</caption>
Of course, calling a .text on this parent tag will give the full text within that tag, if it is not the full text within it.
item = soup.find(string=re.compile('ASIC')).parent.text
will give an output
ASIC registration

The issue you're running into is that the string argument searches for strings instead of for tags as it states in the documentation you linked.
The syntax you are using:
soup.find("caption",string="ASIC registration")
is for finding tags.
For finding strings:
soup.find(string=re.compile('ASIC'))
With the first one you are saying find a caption tag with the "string" attribute of your string. The caption tag has no string attribute so nothing is returned.
The second one is saying find the string that contains 'ASIC', so it returns the string.

Turns out the string parameter doesn't work if a tag has a child tag. The following code is stupid, but it works:
real_item = ""
try:
items = soup.find_all("caption")
r = re.compile(u"ASIC registration", re.I)
for item in items:
for s in item.strings:
if r.search(unicode(s)):
real_item = item
break
except AttributeError:
real_item = ""
print(real_item)

Python - Retrieve specific "object" from url with Beatifulsoup

Im trying to parse a specific "item" on a site, but I dont know if its a class, object, id or something else
my code:
soup = BeautifulSoup(urllib2.urlopen(myURL))
divdata = soup.find('div')
print(divdata)
And it returns:
<div data-store='{"Auth":{"cookie":null,"user":null,"timestamp":1485297666762},"Blocked":{},"Broadcast":
{"forceUpdate":false,"failed":[],"pending":[],"error":
{"isNotFound":false,"isServerError":false,"isUnavailable":false}},"BroadcastCache":{"broadcasts":{"ID1":{"broadcast":
{"data":{"class_name":"Broadcast","id":"ID1","state":"running,
....(more)....
So I want to retrieve the "running" or what ever is in "state"
I tried
statedata = soup.find('div', {"class":"state"})
But it returns nothing, what is the correct way to retrieve it?

import json
div_tag = soup.find('div', {'data-store':True})
data_string = div_tag['data-store'] # get data string
json.loads(data_string)['BroadcastCache']['broadcasts']['ID1']['broadcast']['data']['state'] # convert data string to python dict and get state
out:
'running'

The correct syntax is soup.find_all('div', class_='state').
Note the underscore after class_.
It's unlikely to work in your case without modification, since it looks like the actual class of the div is 'data-store', and the rest is just a string and not actually content of a tag. You could just use string.find('\"state\"') on that one.

media:thumbnail w/ BeautifulSoup

What is the correct way to parse the url attribute of media:thumbnail tags using BeautifulSoup? I have tried the following:
doc = BeautifulSoup(urlopen('http://rss.cnn.com/rss/edition.rss'), 'xml')
items = doc.findAll('item')
for item in items:
title = item.title.text
link = item.link.text
image = item.find('media:thumbnail')[0]['url']
However, I get the 'NoneType' object is not subscriptable error.

Don't include the namespace prefix:
>>> doc.find('thumbnail')
<media:thumbnail height="51" url="http://i2.cdn.turner.com/cnn/dam/assets/150116173806-amateur-video-amedy-coulibaly-top-tease.jpg" width="90"/>
The element.find() method returns one element, so there is no need for subscription here; you can access the url attribute on the element directly:
>>> doc.find('thumbnail')['url']
u'http://i2.cdn.turner.com/cnn/dam/assets/150116173806-amateur-video-amedy-coulibaly-top-tease.jpg'
There currently isn't any support for searching by a specific namespace; the namespace URL is stored (in the .namespace attribute) but not used by .find() or .find_all().

BeautifulSoup: AttributeError: 'NavigableString' object has no attribute 'name'

Do you know why the first example in BeautifulSoup tutorial http://www.crummy.com/software/BeautifulSoup/documentation.html#QuickStart gives AttributeError: 'NavigableString' object has no attribute 'name'? According to this answer the space characters in the HTML causes the problem. I tried with sources of a few pages and 1 worked the others gave the same error (I removed spaces). Can you explain what does "name" refer to and why this error happens? Thanks.

Just ignore NavigableString objects while iterating through the tree:
from bs4 import BeautifulSoup, NavigableString, Tag
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for body_child in soup.body.children:
if isinstance(body_child, NavigableString):
continue
if isinstance(body_child, Tag):
print(body_child.name)

name will refer to the name of the tag if the object is a Tag object (ie: <html> name = "html")
if you have spaces in your markup in between nodes BeautifulSoup will turn those into NavigableString's. So if you use the index of the contents to grab nodes, you might grab a NavigableString instead of the next Tag.
To avoid this, query for the node you are looking for: Searching the Parse Tree
or if you know the name of the next tag you would like, you can use that name as the property and it will return the first Tag with that name or None if no children with that name exist: Using Tag Names as Members
If you wanna use the contents you have to check the objects you are working with. The error you are getting just means you are trying to access the name property because the code assumes it's a Tag

You can use try catch to eliminate the cases when Navigable String is being parsed in the loop, like this:
for j in soup.find_all(...)
try:
print j.find(...)
except NavigableString:
pass

This is the latest working code to obtain the name of the tags in soup.
from bs4 import BeautifulSoup, Tag
res = requests.get(url).content
soup = BeautifulSoup(res, 'lxml')
for child in soup.body.children:
if isinstance(body_child, Tag):
print(child.name)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract URL using beautifulsoup 4 - python

Related

Beatifulsoup doesnt return href it returns None

String argument behaves differently within my script

Python - Retrieve specific "object" from url with Beatifulsoup

media:thumbnail w/ BeautifulSoup

BeautifulSoup: AttributeError: 'NavigableString' object has no attribute 'name'

Categories

Resources