Manipulating HTML-Code in Python3 [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm trying to manipulate a HTML-File and remove a div with a certain id-Tag, using Python3.
Is there a more elegant way to manipulate or remove this container than a mix of for-Loops and regex?
I know, there is the HTMLParser module, but I'm not sure if this will help me (it finds the corresponding tags, but how to remove those and the contents?).

Try lxml and css/xpath queries.
For example, with this html:
<html>
<body>
<p>Some text in a p.</p>
<div class="go-away">Some text in a div.</div>
<div><p>Some text in a p in a div</p></div>
</body>
</html>
You can read that in, remove the div with class "go-away", and output the result with:
import lxml.html
html = lxml.html.fromstring(html_txt)
go_away = html.cssselect('.go-away')[0] # Or with suitable xpath
go_away.getparent().remove(go_away)
lxml.html.tostring(html) # Or lxml.html.tostring(html).decode("utf-8") to get a string

While I can't stress this enough
DON'T PARSE HTML WITH REGEX!!
here's how I'd do it with regex.
from re import sub
new_html = sub('<div class=(\'go-away\'|"go-away")>.*?</div>', '', html)
Even though I think that should be ok, you should never ever use regex to parse anything. More often than anything it creates odd, hard-to-debug issues. It'll create more work for you than you started with. Don't parse with regex.

Related

Python re.search assistance for Django [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
This post was edited and submitted for review 11 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I need to replace part of an an html string via a Django custom template tag that I am customising.
The string in it's raw format is as follows:
arg = "<a class='login-password-forget' tabindex='1' href='{% url 'account:pwdreset'%}'>Forgot your password?</a>"
I need to replace the {% url 'account:pwdreset'%} part with a url string using re.search().
The code that I written is clumsy and I would appreciate help with finding a better way of achieving the same.
url_string = re.search("{.*?}", arg)
url_string_inner = re.search("'(.+?)'", url_string.group())
add_html = SafeText(''.join([arg.split('{')[0], reverse(url_string_inner.group(1)), arg.split('}')[1]]))
!!UPDATE!!
The solution that I ran with is as follows:
url_string = re.search("{.*?}", arg)
url_string_inner = re.search("'(.+?)'", url_string.group())
add_html = SafeText(''.join([arg.split('{')[0], reverse(url_string_inner.group(1)), arg.split('}')[1]]))
Thank you Fourth Bird for your help.
If you only want to replace the the part with 'account:pwdreset' you could use re.sub with a capture group and use that group in the replacement between single quotes
'{%\s*url '([^']*)'%}
Regex demo | Python demo
import re
pattern = r"'{%\s*url '([^']*)'%}"
s = "<a class='login-password-forget' tabindex='1' href='{% url 'account:pwdreset'%}>Forgot your password?</a>"
print(re.sub(pattern, r"'\1'", s))
Output
<a class='login-password-forget' tabindex='1' href=account:pwdreset>Forgot your password?</a>

Replace string between two delimiters in html [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
How can I replace some string located between the delimiters href="" ?
<td>https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n</td>
</tr>
I want to replace this:
href="https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n"
with this:
href="LINK"
For a quick and dirty way, you could use re.sub() to match the 'href' tag and replace it with your own:
import re
html = """<td>https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n</td>
</tr>"""
re.sub('">.*<\/a>', '">LINK<\/a>" ' , html)
Output:
'<td>https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n</td>\n </tr>'
But remember that parsing HTML with regular expressions is not recommended, as it can have many edge cases. I would only use this for a quick and dirty way when I absolutely know how my input HTML is structured. For a more professional approach, you should look into HTML parsers (e.g. 'beautifulsoup').

Search and create list from a string Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am very new to Python and I am trying to create a list out of string in python.
Input = "<html><body><ul style="padding-left: 5pt"><i>(See attached file: File1.pdf)</i><i>(See attached file: File2.ppt)</i><i>(See attached file: File3.docx)</i></ul></body></html>"
Desired Output = [File1.pdf, File2.ppt, File3.docx]
What is the most efficient and pythonic way to achieve this? Any help will be very much appreciated.
Thanks
You can use beatifulsoup, which has HTML parsing utils.
>>> from bs4 import BeautifulSoup
>>> html = """<html><body><ul style="padding-left: 5pt"><i>(See attached file: File1.pdf)</i><i>(See attached file: File2.ppt)</i><i>(See attached file: File3.docx)</i></ul></body></html>"""
>>> soup = BeautifulSoup(html, parser='html')
>>> files_list = [i.text.split('file: ')[1].replace(')', '') for i in soup.find_all('i')]
>>> print(files_list)
['File1.pdf', 'File2.ppt', 'File3.docx']
There might be a nice way to do this using a HTML parser like shree.pat18 suggested but here is a quick and dirty way using string.split()
Output = [s.split(")")[0] for s in Input.split("file: ")[1:]]
By first splitting on "file: " we get list of strings, the first one contains the first part of the original string so we don't care about that one. The others start with the filenames that we want and the first character we don't care about is ")". So split on ")" and take the first part.

How can I use regex in Python to read between HTML tags from the bottom of the file? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a HTML response, and I need to get the data between the last <title> tags on the page, is there a way I can do this with regex in Python or use another tool in Python?
eg.
<title>abc
</title>
<title>def
</title>
Should return def.
You shouldn't use regex to parse HTML as most of the times is inefficient and hard to read. Regex should be the last resort if you don't have any other options. Check here for more info.
Thankfully there are plenty of HTML parsers for Python like BeautifulSoup.
With BeautifulSoup you can get the last title tag with this:
last_title = soup.find_all('title')[-1].text.replace('\n', '')
Use <title>\s*([\s\S]+?)\s*</title> as your regex (strips away leading and trailing whitespace from the title) with findall and take the last occurrence:
Regex Demo
import re
text = """abc
<title>abc
</title>
def
ghi
<title>def
</title>
jkl
"""
tags = re.findall(r'<title>\s*([\s\S]+?)\s*</title>', text)
print(tags[-1]) # the last one
Prints:
def

Web Scraping - How to get a specific part of a weblink [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
i have the following link:
https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk
I have multiple links in a dataset. Each link is of same pattern. I want to get a specific part of the link, for the above link i would be the bold part of the link above. I want text starting from 2nd http to before first + sign.
I don't know how to do so using regex. I am working in python. Kindly help me out.
If each link has the same pattern you do not need regex. You can use string.find() and string cutting
link = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk"
# This finds the second occurrence of "https://" and returns the position
second_https = link.find("https://", link.find("https://")+1)
# Index of the end of the link
end_of_link = link.find("+")
new_link = link[second_https:end_of_link]
print(new_link)
This will return "https://cooking.nytimes.com/learn-to-cook" and will work if the link follows the same pattern as described (it is the second https:// in the link and ends with + sign)
I'd go with urlparse (Python 2) or urlparse (Python 3) and a little bit of regex:
import re
from urlparse import urlparse
url_example = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk"
parsed = urlparse(url_example)
result = re.findall('https?.*', parsed.query)[0].split('+')[0]
print(result)
Output:
https://cooking.nytimes.com/learn-to-cook

Categories