I have a string with tagged elements inside. I want to remove the tags and add some characters to the content inside the tags.
s = 'Hello there <something>, this is more text <tagged content>'
result = 'Hello there somethingADDED, this is more text tagged contentADDED
So far, I've tried
import re
result = re.search('\<(.*)\>', s)
result = result.group(1)
and s = s.split('>') and regex each substring one by one, but it doesn't seem like the correct or efficient way of doing this.
Use back-reference \1.
x="Hello there <something>, this is more text <tagged content>"
print re.sub(r"<([^>]*)>",r"\1added",x)
Output :Hello there somethingadded, this is more text tagged contentadded
Related
<p>I'd like to find the string between the two paragraph tags.</p><br><p>And also this string</p>
How would I get the string between the first two paragraph tags? And then, how would I get the string between the 2nd paragraph tags?
Regular expressions
import re
matches = re.findall(r'<p>.+?</p>',string)
The following is your text run in console.
>>>import re
>>>string = """<p>I'd like to find the string between the two paragraph tags.</p><br><p>And also this string</p>"""
>>>re.findall('<p>.+?</p>',string)
["<p>I'd like to find the string between the two paragraph tags.</p>", '<p>And also this string</p>']
If you want the string between the p tags (excluding the p tags) then add parenthesis to .+? in the findall method
import re
string = """<p>I'd like to find the string between the two paragraph tags.</p><br><p>And also this string</p>"""
subStr = re.findall(r'<p>(.+?)</p>',string)
print subStr
Result
["I'd like to find the string between the two paragraph tags.", 'And also this string']
In between <p> and </p>
In [7]: content = "<p>I'd like to find the string between the two paragraph tags.</p><br><p>And also this string</p>"
In [8]: re.findall(r'<p>(.+?)</p>', content)
Out[8]:
["I'd like to find the string between the two paragraph tags.",
'And also this string']
in python, i'm trying to extract all time-ranges (of the form HHmmss-HHmmss) from a string. i'm using this python code.
text = "random text 0700-1300 random text 1830-2230 random 1231 text"
regex = "(.*(\d{4,10}-\d{4,10}))*.*"
match = re.search(regex, text)
this only returns 1830-2230 but i'd like to get 0700-1300 and 1830-2230. in my application there may be zero or any number of time-ranges (within reason) in the text string. i'd appreciate any hints.
Try to use re.findall to find all matches (Regex demo.):
import re
text = "random text 0700-1300 random text 1830-2230 random 1231 text"
pat = r"\b\d{4,10}-\d{4,10}\b"
for m in re.findall(pat, text):
print(m)
Prints:
0700-1300
1830-2230
Try this regex.
(\d{4}-\d{4})
All you need is pattern {four digts}minus{ four digits}. Other parts are not needed.
how can I identify a certain pattern of a string in python to remove it? What I want to do clean my string at every occurence of {$arbitrary}
my_string = "hello. This is my example of {$test} {$foo} {$string} my string"
my_string = my_string.replace("{$arbitrary}", "")
Output I want:
hello. This is my example of my string
So I get a file from client like this (4 lines are displayed below)
Some text #instagram_h1 #instagram_h2 some more text #instagram_h3 more texts
Some text #instagram_h3 #instagram_h2 some more text #instagram_h1 more texts
Some text #instagram_h2 some more text #instagram_h3 more texts
Some text some more text #instagram_h3 more texts
I am looking to search for only lines which contain #instagram_h3 and discard lines which has any or both of #instagram_h1 and #instagram_h2. #instagram_h3 will be always present.
My attempt:
h1 = '#instagram_h1'
h2 = '#instagram_h2'
h3 = '#instagram_h3'
result = re.search(r"(!h1|!h2)", str)
print result
here result is always None. Can anyone please explain, what am i doing wrong?
There is no regex ! operator. What you can do instead is find line that do contain those strings, and then exclude them.
if re.search(r"#instagram_(h1|h2)\b", str):
# no good!
Notice how I've added \b to prevent something like #instagram_h123 from matching.
Alternatively, for a simple search like this you could skip regexes and check for the substrings directly.
if '#instagram_h1' in str or '#instagram_h2' in str:
# no good!
# or
hashtags = ['#instagram_h1', '#instagram_h2']
if any(hashtag in str for hashtag in hashtags):
# sorry!
Note that these simple tests will match #instagram_123 or #instagram_234, which may not be what you want.
Is there a way to do substitution on a group?
Say I am trying to insert a link into text, based on custom formatting. So, given something like this:
This is a random text. This should be a [[link somewhere]]. And some more text at the end.
I want to end up with
This is a random text. This should be a link somewhere. And some more text at the end.
I know that '\[\[(.*?)\]\]' will match stuff within square brackets as group 1, but then I want to do another substitution on group 1, so that I can replace space with _.
Is that doable in a single re.sub regex expression?
You can use a function as a replacement instead of string.
>>> import re
>>> def as_link(match):
... link = match.group(1)
... return '{}'.format(link.replace(' ', '_'), link)
...
>>> text = 'This is a random text. This should be a [[link somewhere]]. And some more text at the end.'
>>> re.sub(r'\[\[(.*?)\]\]', as_link, text)
'This is a random text. This should be a link somewhere. And some more text at the end.'
You could do something like this.
import re
pattern = re.compile(r'\[\[([^]]+)\]\]')
def convert(text):
def replace(match):
link = match.group(1)
return '{}'.format(link.replace(' ', '_'), link)
return pattern.sub(replace, text)
s = 'This is a random text. This should be a [[link somewhere]]. .....'
convert(s)
See working demo