I need to create a <img /> tag.
BeautifulSoup creates an image tag like this with code I did:
soup = BeautifulSoup(text, "html5")
tag = Tag(soup, name='img')
tag.attrs = {'src': '/some/url/here'}
text = soup.renderContents()
print text
Output: <img src="/some/url/here"></img>
How to make it? : <img src="/some/url/here" />
It can be of course done with REGEX or similar chemistry. However I was wondering maybe there is any standard way to produce tags like this?
Don't use Tag() to create new elements. Use the soup.new_tag() method:
soup = BeautifulSoup(text, "html5")
new_tag = soup.new_tag('img', src='/some/url/here')
some_element.append(new_tag)
The soup.new_tag() method will pass along the correct builder to the Tag() object, and it is the builder that is responsible for recognising <img/> as an empty tag.
Demo:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<div></div>', "html5")
>>> new_tag = soup.new_tag('img', src='/some/url/here')
>>> new_tag
<img src="/some/url/here"/>
>>> soup.div.append(new_tag)
>>> print soup.prettify()
<html>
<head>
</head>
<body>
<div>
<img src="/some/url/here"/>
</div>
</body>
</html>
In BS4 you can also do this:
img = BeautifulSoup('<img src="/some/url/here" />', 'lxml').img
print(img)
print(type(img))
which will output:
<img src="/some/url/here"/>
<class 'bs4.element.Tag'>
Related
The html is as follows:
<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>
I'm trying to get all the divs and cast them into strings:
divs = [str(i) for i in soup.find_all('div')]
However, they'll have their children too:
>>> ["<div name='tag-i-want'><span>I don't want this</span></div>"]
What I'd like it to be is:
>>> ["<div name='tag-i-want'></div>"]
I figured there is unwrap() which would return this, but it modifies the soup as well; I'd like the soup to remain untouched.
With clear you remove the tag's content.
Without altering the soup you can either do an hardcopy with copy or use a DIY approach. Here an example with the copy
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div')
div_only = copy.copy(div)
div_only.clear()
print(div_only)
print(soup.find_all('span') != [])
Output
<div name="tag-i-want"></div>
True
Remark: the DIY approach: without copy
use the Tag class
from bs4 import BeautifulSoup, Tag
...
div_only = Tag(name='div', attrs=div.attrs)
use strings
div_only = '<div {}></div>'.format(' '.join(map(lambda p: f'{p[0]}="{p[1]}"', div.attrs.items())))
#cards pointed me in the right direction with copy(). This is what I ended up using:
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
def remove_children(tag):
tag.clear()
return tag
divs = [str(remove_children(copy.copy(i))) for i in soup.find_all('div')]
This is a snippet of the HTML I have:
<body>
<form method="post" action="./pagam.aspx?a=9095709&b=RkVsgP1UClEdbu0oUvc8pKDxd5OcslXk1xHlVhK7uuqH_7ZfaquNNa1VHgeSZWm9hAq4s7Thk6wIhoRsooDoMF7U2nzmVDDbRujlxaPTg8I" id="aspnetForm" autocomplete="off">
<div>
I would like to extract this value:
./pagam.aspx?a=9095709&b=RkVsgP1UClEdbu0oUvc8pKDxd5OcslXk1xHlVhK7uuqH_7ZfaquNNa1VHgeSZWm9hAq4s7Thk6wIhoRsooDoMF7U2nzmVDDbRujlxaPTg8I
from the HTML.
I currently have this as unsure how to do it:
parsed_html = BeautifulSoup(html, 'lxml')
a = parsed_html.body.find('div', attrs={'form method':'post'})
print (a)
import re
r = re.compile('action="\S+"')
r.match(line)
line[r.start():r.end()].split("=")
Here is something you can try :
>>> from bs4 import BeautifulSoup
>>> s = BeautifulSoup('<body>
<form method="post" name="mainForm" action="./pagam.aspx?a=9095709&b=RkVsgP1UClEdbu0oUvc8pKDxd5OcslXk1xHlVhK7uuqH_7ZfaquNNa1VHgeSZWm9hAq4s7Thk6wIhoRsooDoMF7U2nzmVDDbRujlxaPTg8I" id="aspnetForm" autocomplete="off"></body>')
>>> s.find("form", {"name":"mainForm"})
>>> s.find("form", {"name":"mainForm"})['action']
u'./pagam.aspx?a=9095709&b=RkVsgP1UClEdbu0oUvc8pKDxd5OcslXk1xHlVhK7uuqH_7ZfaquNNa1VHgeSZWm9hAq4s7Thk6wIhoRsooDoMF7U2nzmVDDbRujlxaPTg8I'
Would like to get the wrapper of a key text. For example, in HTML:
…
<div class=“target”>chicken</div>
<div class=“not-target”>apple</div>
…
And by based on the text “chicken”, would like to get back <div class=“target”>chicken</div>.
Currently, have the following to fetch the HTML:
import requests
from bs4 import BeautifulSoup
req = requests.get(url).txt
soup = BeautifulSoup(r, ‘html.parser’)
And having to just do soup.find_all(‘div’,…) and loop through all available div to find the wrapper that I am looking for.
But without having to loop through every div, What would be the proper and most optimal way of fetching the wrapper in HTML based on a defined text?
Thank you in advance and will be sure to accept/upvote answer!
# coding: utf-8
html_doc = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title> Last chicken leg on stock! Only 500$ !!! </title>
</head>
</body>
<div id="layer1" class="class1">
<div id="layer2" class="class2">
<div id="layer3" class="class3">
<div id="layer4" class="class4">
<div id="layer5" class="class5">
<p>My chicken has <span style="color:blue">ONE</span> leg :P</p>
<div id="layer6" class="class6">
<div id="layer7" class="class7">
<div id="chicken_surname" class="chicken">eat me</div>
<div id="layer8" class="class8">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>"""
from bs4 import BeautifulSoup as BS
import re
soup = BS(html_doc, "lxml")
# (tag -> text) direction is pretty obvious that way
tag = soup.find('div', class_="chicken")
tag2 = soup.find('div', {'id':"chicken_surname"})
print('\n###### by_cls:')
print(tag)
print('\n###### by_id:')
print(tag2)
# but can be tricky when need to find tag by substring
tag_by_str = soup.find(string="eat me")
tag_by_sub = soup.find(string="eat")
tag_by_resub = soup.find(string=re.compile("eat"))
print('\n###### tag_by_str:')
print(tag_by_str)
print('\n###### tag_by_sub:')
print(tag_by_sub)
print('\n###### tag_by_resub:')
print(tag_by_resub)
# there are more than one way to access underlying strings
# both are different - see results
tag = soup.find('p')
print('\n###### .text attr:')
print( tag.text, type(tag.text) )
print('\n###### .strings generator:')
for s in tag.strings: # strings is an generator object
print s, type(s)
# note that .strings generator returns list of bs4.element.NavigableString elements
# so we can use them to navigate, for example accessing their parents:
print('\n###### NavigableString parents:')
for s in tag.strings:
print s.parent
# or even grandparents :)
print('\n###### grandparents:')
for s in tag.strings:
print s.parent.parent
I'm trying to extract an image source url from a HTML img tag.
if html data is like below:
<div> My profile <img width='300' height='300' src='http://domain.com/profile.jpg'> </div>
or
<div> My profile <img width="300" height="300" src="http://domain.com/profile.jpg"> </div>
how's the regex in python?
I had tried below:
i = re.compile('(?P<src>src=[["[^"]+"][\'[^\']+\']])')
i.search(htmldata)
but I got an error
Traceback (most recent call last):
File "<input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
BeautifulSoup parser is the way to go.
>>> from bs4 import BeautifulSoup
>>> s = '''<div> My profile <img width='300' height='300' src='http://domain.com/profile.jpg'> </div>'''
>>> soup = BeautifulSoup(s, 'html.parser')
>>> img = soup.select('img')
>>> [i['src'] for i in img if i['src']]
[u'http://domain.com/profile.jpg']
>>>
I adapted your code a little bit. Please take a look:
import re
url = """<div> My profile <img width="300" height="300" src="http://domain.com/profile.jpg"> </div>"""
ur11 = """<div> My profile <img width='300' height='300' src='http://domain.com/profile.jpg'> </div>"""
link = re.compile("""src=[\"\'](.+)[\"\']""")
links = link.finditer(url)
for l in links:
print l.group()
print l.groups()
links1 = link.finditer(ur11)
for l in links1:
print l.groups()
In l.groups() you can find the link.
The output is this:
src="http://domain.com/profile.jpg"
('http://domain.com/profile.jpg',)
('http://domain.com/profile.jpg',)
finditer() is a generator and allows using a for in loop.
Sources:
http://www.tutorialspoint.com/python/python_reg_expressions.htm
https://docs.python.org/2/howto/regex.html
The code below must generate a list which contains all the h1 which contain the class fluid. But it returns an empty list. I cannot find the mistake, can anyone help me
allh1= soup.findAll('h1')
classes = [ h1.get('class') for h1 in allh1]
fluid_list = []
for item in classes:
if item == 'fluid':
fluid_list.append(item)
print fluid_list
Your code doesn't work because your classes list contains a list of lists of the classes for each h1 found (or None if there is no class):
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
</head>
<body>
<h1>header 1</h1>
<h1 class="fluid">header 2</h1>
<h1>header 3</h1>
<h1 class="fluid static">header 4</h1>
</body>
</html>
"""
soup = BeautifulSoup(html_doc)
allh1= soup.findAll('h1')
classes = [ h1.get('class') for h1 in allh1]
print(classes)
[None, ['fluid'], None, ['fluid', 'static']]
If you're using Beautiful Soup 4.1.2+, you can use class_, however:
fluid_list = soup.find_all('h1', class_='fluid')
print fluid_list
[<h1 class="fluid">header 2</h1>, <h1 class="fluid static">header 4</h1>]
This returns the h1 elements themselves, which I assume is what you want.