How to identfy pattern with arbitrary content python and remove it - python

how can I identify a certain pattern of a string in python to remove it? What I want to do clean my string at every occurence of {$arbitrary}
my_string = "hello. This is my example of {$test} {$foo} {$string} my string"
my_string = my_string.replace("{$arbitrary}", "")
Output I want:
hello. This is my example of my string

Related

How to find specific pattern in a paragraph in Python?

I want to find a specific pattern in a paragraph. The pattern must contain a-zA-Z and 0-9 and length is 5 or more than 5. How to implement it on Python?
My code is:
str = "I love5 verye mu765ch"
print(re.findall('(?=.*[0-9])(?=.*[a-zA-Z]{5,})',str))
this will return a null.
Expected result like:
love5
mu765ch
the valid pattern is like:
9aacbe
aver23893dk
asdf897
This is easily done with some programming logic and a simple regex:
import re
string = "I love5 verye mu765ch a123...bbb"
pattern = re.compile(r'(?=\D*\d)(?=[^a-zA-Z]*[a-zA-Z]).{5,}')
interesting = [word for word in string.split() if pattern.match(word)]
print(interesting)
This yields
['love5', 'mu765ch', 'a123...bbb']
See a demo on ideone.com.

How to remove text before a particular character or string in multi-line text?

I want to remove all the text before and including */ in a string.
For example, consider:
string = ''' something
other things
etc. */ extra text.
'''
Here I want extra text. as the output.
I tried:
string = re.sub("^(.*)(?=*/)", "", string)
I also tried:
string = re.sub(re.compile(r"^.\*/", re.DOTALL), "", string)
But when I print string, it did not perform the operation I wanted and the whole string is printing.
I suppose you're fine without regular expressions:
string[string.index("*/ ")+3:]
And if you want to strip that newline:
string[string.index("*/ ")+3:].rstrip()
The problem with your first regex is that . does not match newlines as you noticed. With your second one, you were closer but forgot the * that time. This would work:
string = re.sub(re.compile(r"^.*\*/", re.DOTALL), "", string)
You can also just get the part of the string that comes after your "*/":
string = re.search(r"(\*/)(.*)", string, re.DOTALL).group(2)
Update: After doing some research, I found that the pattern (\n|.) to match everything including newlines is inefficient. I've updated the answer to use [\s\S] instead as shown on the answer I linked.
The problem is that . in python regex matches everything except newlines. For a regex solution, you can do the following:
import re
strng = ''' something
other things
etc. */ extra text.
'''
print(re.sub("[\s\S]+\*/", "", strng))
# extra text.
Add in a .strip() if you want to remove that remaining leading whitespace.
to keep text until that symbol you can do:
split_str = string.split(' ')
boundary = split_str.index('*/')
new = ' '.join(split_str[0:boundary])
print(new)
which gives you:
something
other things
etc.
string_list = string.split('*/')[1:]
string = '*/'.join(string_list)
print(string)
gives output as
' extra text. \n'

How to extract just the characters "abc-3456" from the given text in python

i have this code
import re
text = "this is my desc abc-3456"
m = re.findall("\w+\\-\d+", text)
print m
This prints ['abc-3456'] but i want to get only abc-3456 (without the square brackets and the quotes].
How to do this?
import re
text = "this is my desc abc-3456"
m = re.findall("\w+\\-\d+", text)
print m[0]
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings.
findall returns list of strings. If you want the first one then use m[0].
print m[0] will give string without [] and ''.
If you only want the first (or only) result, do this:
import re
text = "this is my desc abc-3456"
m = re.search("\w+\\-\d+", text)
print m.group()
re.findall retuns a list of matches. In that list the result is a string. You can use re.finditer if you want.
In python, a list's representation is in brackets: [member1, member2, ...].
A string ("somestring") representation is in quotes: 'somestring'.
This means the representation of a list of strings is:
['somestring1', 'somestring2', ...]
So you have a string in a list, the characters you want to remove are a part of python's representation and not a part of the data you have.
To get the string simply take the first element from the list:
mystring = m[0]

regex search and replace with modified result

I have a string with tagged elements inside. I want to remove the tags and add some characters to the content inside the tags.
s = 'Hello there <something>, this is more text <tagged content>'
result = 'Hello there somethingADDED, this is more text tagged contentADDED
So far, I've tried
import re
result = re.search('\<(.*)\>', s)
result = result.group(1)
and s = s.split('>') and regex each substring one by one, but it doesn't seem like the correct or efficient way of doing this.
Use back-reference \1.
x="Hello there <something>, this is more text <tagged content>"
print re.sub(r"<([^>]*)>",r"\1added",x)
Output :Hello there somethingadded, this is more text tagged contentadded

Breaking up substrings in Python based on characters

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

Categories