Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
How can I replace some string located between the delimiters href="" ?
<td>https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n</td>
</tr>
I want to replace this:
href="https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n"
with this:
href="LINK"
For a quick and dirty way, you could use re.sub() to match the 'href' tag and replace it with your own:
import re
html = """<td>https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n</td>
</tr>"""
re.sub('">.*<\/a>', '">LINK<\/a>" ' , html)
Output:
'<td>https://forms.office.com/Pages/ResponsePage.aspx?id=uI1n</td>\n </tr>'
But remember that parsing HTML with regular expressions is not recommended, as it can have many edge cases. I would only use this for a quick and dirty way when I absolutely know how my input HTML is structured. For a more professional approach, you should look into HTML parsers (e.g. 'beautifulsoup').
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a HTML response, and I need to get the data between the last <title> tags on the page, is there a way I can do this with regex in Python or use another tool in Python?
eg.
<title>abc
</title>
<title>def
</title>
Should return def.
You shouldn't use regex to parse HTML as most of the times is inefficient and hard to read. Regex should be the last resort if you don't have any other options. Check here for more info.
Thankfully there are plenty of HTML parsers for Python like BeautifulSoup.
With BeautifulSoup you can get the last title tag with this:
last_title = soup.find_all('title')[-1].text.replace('\n', '')
Use <title>\s*([\s\S]+?)\s*</title> as your regex (strips away leading and trailing whitespace from the title) with findall and take the last occurrence:
Regex Demo
import re
text = """abc
<title>abc
</title>
def
ghi
<title>def
</title>
jkl
"""
tags = re.findall(r'<title>\s*([\s\S]+?)\s*</title>', text)
print(tags[-1]) # the last one
Prints:
def
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I want to remove links in the format Reddit uses
comment = "Hello this is my [website](https://www.google.com)"
no_links = RemoveLinks(comment)
# no_links == "Hello this is my website"
I found a similar question about the same thing, but I don't know how to translate it to python.
I am not that familiar with regex so I would appreciate it if you explained what's happening.
You could do the following:
import re
pattern = re.compile('\[(.*?)\]\(.*?\)')
comment = "Hello this is my [website](https://www.google.com)"
print(pattern.sub(r'\1', comment))
The line:
pattern = re.compile('\[(.*?)\]\(.*?\)')
creates a regex pattern that will search for anything surrounded by square brackets, followed by anything surrounded by parenthesis, the '?' indicates that they should match as little text as possible (non-greedy).
The function sub(r'\1', comment) replaces a match by the first capturing group in this case the text inside the brackets.
For more information about regex I suggest you read this.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Hi I would like to get everything after the '_' using regex
For example: I have --> I want
'aaa_bbb_ccc' --> 'bbb_ccc'
'dd_aaaa_1' --> 'aaaa_1'
'*/_2d*_//' --> '2d*_//'
Is there anyway to do it?
Thanks in advance.
I rather like the split suggestion given by #Maroun in a comment above. Here is an option using re.sub:
x = "aaa_bbb_ccc"
output = re.sub(r'^[^_]+_', '', x)
print(output)
bbb_ccc
The regex does not require much explanation, and it just removes all content up to, and including, the first underscore in the input string.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm trying to manipulate a HTML-File and remove a div with a certain id-Tag, using Python3.
Is there a more elegant way to manipulate or remove this container than a mix of for-Loops and regex?
I know, there is the HTMLParser module, but I'm not sure if this will help me (it finds the corresponding tags, but how to remove those and the contents?).
Try lxml and css/xpath queries.
For example, with this html:
<html>
<body>
<p>Some text in a p.</p>
<div class="go-away">Some text in a div.</div>
<div><p>Some text in a p in a div</p></div>
</body>
</html>
You can read that in, remove the div with class "go-away", and output the result with:
import lxml.html
html = lxml.html.fromstring(html_txt)
go_away = html.cssselect('.go-away')[0] # Or with suitable xpath
go_away.getparent().remove(go_away)
lxml.html.tostring(html) # Or lxml.html.tostring(html).decode("utf-8") to get a string
While I can't stress this enough
DON'T PARSE HTML WITH REGEX!!
here's how I'd do it with regex.
from re import sub
new_html = sub('<div class=(\'go-away\'|"go-away")>.*?</div>', '', html)
Even though I think that should be ok, you should never ever use regex to parse anything. More often than anything it creates odd, hard-to-debug issues. It'll create more work for you than you started with. Don't parse with regex.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
for some reason when I get regex to get the number i need it returns none.
But when I run it here http://regexr.com/38n3o it works
the regex was designed to get the last number of the ip so it can be removed
lanip=74.125.224.72
notorm=re.search("/([1-9])\w+$/g", lanip)
That is not how you define a regular expressions in Python. The correct way would be:
import re
lanip="74.125.224.72"
notorm=re.search("([1-9])\w+$", lanip)
print notorm
Output:
<_sre.SRE_Match object at 0x10131df30>
You were using a javascript regex style. To read more on correct python syntax read the documentation
If you want to match the last number of an IP use:
import re
lanip="74.125.224.72"
notorm=re.search("(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)", lanip)
print notorm.group(4)
Output:
72
Regex used from http://www.regular-expressions.info/examples.html
Your example did work in this scenario, but would match a lot of false positives.
What is lanip's type? That can't run.
It needs to be a string, i.e.
lanip = "74.125.224.72"
Also your RE syntax looks strange, make sure you've read the documentation on Python's RE syntax.