Parsing HTML value in Python

Parsing HTML value in Python - python

This is a snippet of the HTML I have:
<body>
<form method="post" action="./pagam.aspx?a=9095709&b=RkVsgP1UClEdbu0oUvc8pKDxd5OcslXk1xHlVhK7uuqH_7ZfaquNNa1VHgeSZWm9hAq4s7Thk6wIhoRsooDoMF7U2nzmVDDbRujlxaPTg8I" id="aspnetForm" autocomplete="off">
<div>
I would like to extract this value:
./pagam.aspx?a=9095709&b=RkVsgP1UClEdbu0oUvc8pKDxd5OcslXk1xHlVhK7uuqH_7ZfaquNNa1VHgeSZWm9hAq4s7Thk6wIhoRsooDoMF7U2nzmVDDbRujlxaPTg8I
from the HTML.
I currently have this as unsure how to do it:
parsed_html = BeautifulSoup(html, 'lxml')
a = parsed_html.body.find('div', attrs={'form method':'post'})
print (a)

import re
r = re.compile('action="\S+"')
r.match(line)
line[r.start():r.end()].split("=")

Here is something you can try :
>>> from bs4 import BeautifulSoup
>>> s = BeautifulSoup('<body>
<form method="post" name="mainForm" action="./pagam.aspx?a=9095709&b=RkVsgP1UClEdbu0oUvc8pKDxd5OcslXk1xHlVhK7uuqH_7ZfaquNNa1VHgeSZWm9hAq4s7Thk6wIhoRsooDoMF7U2nzmVDDbRujlxaPTg8I" id="aspnetForm" autocomplete="off"></body>')
>>> s.find("form", {"name":"mainForm"})
>>> s.find("form", {"name":"mainForm"})['action']
u'./pagam.aspx?a=9095709&b=RkVsgP1UClEdbu0oUvc8pKDxd5OcslXk1xHlVhK7uuqH_7ZfaquNNa1VHgeSZWm9hAq4s7Thk6wIhoRsooDoMF7U2nzmVDDbRujlxaPTg8I'

Related

Using BeautifulSoup, how to select a tag without its children?

The html is as follows:
<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>
I'm trying to get all the divs and cast them into strings:
divs = [str(i) for i in soup.find_all('div')]
However, they'll have their children too:
>>> ["<div name='tag-i-want'><span>I don't want this</span></div>"]
What I'd like it to be is:
>>> ["<div name='tag-i-want'></div>"]
I figured there is unwrap() which would return this, but it modifies the soup as well; I'd like the soup to remain untouched.

With clear you remove the tag's content.
Without altering the soup you can either do an hardcopy with copy or use a DIY approach. Here an example with the copy
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div')
div_only = copy.copy(div)
div_only.clear()
print(div_only)
print(soup.find_all('span') != [])
Output
<div name="tag-i-want"></div>
True
Remark: the DIY approach: without copy
use the Tag class
from bs4 import BeautifulSoup, Tag
...
div_only = Tag(name='div', attrs=div.attrs)
use strings
div_only = '<div {}></div>'.format(' '.join(map(lambda p: f'{p[0]}="{p[1]}"', div.attrs.items())))

#cards pointed me in the right direction with copy(). This is what I ended up using:
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
def remove_children(tag):
tag.clear()
return tag
divs = [str(remove_children(copy.copy(i))) for i in soup.find_all('div')]

URLs separation with bs4 and Python

I am scraping a site for a bunch of links and those links are in single HTML div tag with <br /> tag to line break, but when I try to get all URLs from that div it just coming in a single string.
I am unable to separate then in list. My code is as follows:
with below code I'm scraping all links:
links = soup.find('div', id='dle-content').find('div', class_='full').find(
'div', class_='full-news').find('div', class_='quote').text
Following is html from site:
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
Output which I get from above code:
https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/
Output which I want:
[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]

Try this:
from bs4 import BeautifulSoup
sample = """<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>"""
soup = BeautifulSoup(sample, "html.parser").find_all("div", class_="quote")
print([i.getText().split() for i in soup])
Output:
[['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']]

You could fix it with a string manipulation:
new_output = ' http'.join(output.split('http')).split()

Split the string, and then use list comprehension to bring it together:
output = 'https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/'
split_output = output.split()
new_output = [x for x in split_output if x != '']
Output:
print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

Another way to achieve the desired output:
from bs4 import BeautifulSoup
html = """
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
"""
soup = BeautifulSoup(html,"html.parser")
print([i.strip() for i in soup.find("div",class_="quote").strings if i!='\n'])
Output:
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

BeautifulSoup create a <img /> tag

I need to create a <img /> tag.
BeautifulSoup creates an image tag like this with code I did:
soup = BeautifulSoup(text, "html5")
tag = Tag(soup, name='img')
tag.attrs = {'src': '/some/url/here'}
text = soup.renderContents()
print text
Output: <img src="/some/url/here"></img>
How to make it? : <img src="/some/url/here" />
It can be of course done with REGEX or similar chemistry. However I was wondering maybe there is any standard way to produce tags like this?

Don't use Tag() to create new elements. Use the soup.new_tag() method:
soup = BeautifulSoup(text, "html5")
new_tag = soup.new_tag('img', src='/some/url/here')
some_element.append(new_tag)
The soup.new_tag() method will pass along the correct builder to the Tag() object, and it is the builder that is responsible for recognising <img/> as an empty tag.
Demo:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<div></div>', "html5")
>>> new_tag = soup.new_tag('img', src='/some/url/here')
>>> new_tag
<img src="/some/url/here"/>
>>> soup.div.append(new_tag)
>>> print soup.prettify()
<html>
<head>
</head>
<body>
<div>
<img src="/some/url/here"/>
</div>
</body>
</html>

In BS4 you can also do this:
img = BeautifulSoup('<img src="/some/url/here" />', 'lxml').img
print(img)
print(type(img))
which will output:
<img src="/some/url/here"/>
<class 'bs4.element.Tag'>

Python beautifulsoup extract date between <a> tag (and associate it with his own url) [duplicate]

I have an html code like this:
<h2 class="title">My HomePage</h2>
<h2 class="title">Sections</h2>
I need to extract the texts (link descriptions) between 'a' tags. I need an array to store these like:
a[0] = "My HomePage"
a[1] = "Sections"
I need to do this in python using BeautifulSoup.
Please help me, thank you!

You can do something like this:
import BeautifulSoup
html = """
<html><head></head>
<body>
<h2 class='title'><a href='http://www.gurletins.com'>My HomePage</a></h2>
<h2 class='title'><a href='http://www.gurletins.com/sections'>Sections</a></h2>
</body>
</html>
"""
soup = BeautifulSoup.BeautifulSoup(html)
print [elm.a.text for elm in soup.findAll('h2', {'class': 'title'})]
# Output: [u'My HomePage', u'Sections']

print [a.findAll(text=True) for a in soup.findAll('a')]

The following code extracts text (link descriptions) between 'a' tags and stores in an array.
>>> from bs4 import BeautifulSoup
>>> data = """<h2 class="title"><a href="http://www.gurletins.com">My
HomePage</a></h2>
...
... <h2 class="title">Sections
</h2>"""
>>> soup = BeautifulSoup(data, "html.parser")
>>> reqTxt = soup.find_all("h2", {"class":"title"})
>>> a = []
>>> for i in reqTxt:
... a.append(i.get_text())
...
>>> a
['My HomePage', 'Sections']
>>> a[0]
'My HomePage'
>>> a[1]
'Sections'

Matching partial ids in BeautifulSoup

I'm using BeautifulSoup. I have to find any reference to the <div> tags with id like: post-#.
For example:
<div id="post-45">...</div>
<div id="post-334">...</div>
I have tried:
html = '<div id="post-45">...</div> <div id="post-334">...</div>'
soupHandler = BeautifulSoup(html)
print soupHandler.findAll('div', id='post-*')
How can I filter this?

You can pass a function to findAll:
>>> print soupHandler.findAll('div', id=lambda x: x and x.startswith('post-'))
[<div id="post-45">...</div>, <div id="post-334">...</div>]
Or a regular expression:
>>> print soupHandler.findAll('div', id=re.compile('^post-'))
[<div id="post-45">...</div>, <div id="post-334">...</div>]

Since he is asking to match "post-#somenumber#", it's better to precise with
import re
[...]
soupHandler.findAll('div', id=re.compile("^post-\d+"))

soupHandler.findAll('div', id=re.compile("^post-$"))
looks right to me.

This works for me:
from bs4 import BeautifulSoup
import re
html = '<div id="post-45">...</div> <div id="post-334">...</div>'
soupHandler = BeautifulSoup(html)
for match in soupHandler.find_all('div', id=re.compile("post-")):
print match.get('id')
>>>
post-45
post-334

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing HTML value in Python - python

import re r = re.compile('action="\S+"') r.match(line) line[r.start():r.end()].split("=")

Related

Using BeautifulSoup, how to select a tag without its children?

URLs separation with bs4 and Python

BeautifulSoup create a <img /> tag

Python beautifulsoup extract date between <a> tag (and associate it with his own url) [duplicate]

Matching partial ids in BeautifulSoup

Categories

Resources