How to identfy pattern with arbitrary content python and remove it

How to identfy pattern with arbitrary content python and remove it - python

how can I identify a certain pattern of a string in python to remove it? What I want to do clean my string at every occurence of {$arbitrary}
my_string = "hello. This is my example of {$test} {$foo} {$string} my string"
my_string = my_string.replace("{$arbitrary}", "")
Output I want:
hello. This is my example of my string

Related

How to find specific pattern in a paragraph in Python?

I want to find a specific pattern in a paragraph. The pattern must contain a-zA-Z and 0-9 and length is 5 or more than 5. How to implement it on Python?
My code is:
str = "I love5 verye mu765ch"
print(re.findall('(?=.*[0-9])(?=.*[a-zA-Z]{5,})',str))
this will return a null.
Expected result like:
love5
mu765ch
the valid pattern is like:
9aacbe
aver23893dk
asdf897

This is easily done with some programming logic and a simple regex:
import re
string = "I love5 verye mu765ch a123...bbb"
pattern = re.compile(r'(?=\D*\d)(?=[^a-zA-Z]*[a-zA-Z]).{5,}')
interesting = [word for word in string.split() if pattern.match(word)]
print(interesting)
This yields
['love5', 'mu765ch', 'a123...bbb']
See a demo on ideone.com.

How to remove text before a particular character or string in multi-line text?

I want to remove all the text before and including */ in a string.
For example, consider:
string = ''' something
other things
etc. */ extra text.
'''
Here I want extra text. as the output.
I tried:
string = re.sub("^(.*)(?=*/)", "", string)
I also tried:
string = re.sub(re.compile(r"^.\*/", re.DOTALL), "", string)
But when I print string, it did not perform the operation I wanted and the whole string is printing.

I suppose you're fine without regular expressions:
string[string.index("*/ ")+3:]
And if you want to strip that newline:
string[string.index("*/ ")+3:].rstrip()

The problem with your first regex is that . does not match newlines as you noticed. With your second one, you were closer but forgot the * that time. This would work:
string = re.sub(re.compile(r"^.*\*/", re.DOTALL), "", string)
You can also just get the part of the string that comes after your "*/":
string = re.search(r"(\*/)(.*)", string, re.DOTALL).group(2)

Update: After doing some research, I found that the pattern (\n|.) to match everything including newlines is inefficient. I've updated the answer to use [\s\S] instead as shown on the answer I linked.
The problem is that . in python regex matches everything except newlines. For a regex solution, you can do the following:
import re
strng = ''' something
other things
etc. */ extra text.
'''
print(re.sub("[\s\S]+\*/", "", strng))
# extra text.
Add in a .strip() if you want to remove that remaining leading whitespace.

to keep text until that symbol you can do:
split_str = string.split(' ')
boundary = split_str.index('*/')
new = ' '.join(split_str[0:boundary])
print(new)
which gives you:
something
other things
etc.

string_list = string.split('*/')[1:]
string = '*/'.join(string_list)
print(string)
gives output as
' extra text. \n'

How to extract just the characters "abc-3456" from the given text in python

i have this code
import re
text = "this is my desc abc-3456"
m = re.findall("\w+\\-\d+", text)
print m
This prints ['abc-3456'] but i want to get only abc-3456 (without the square brackets and the quotes].
How to do this?

import re
text = "this is my desc abc-3456"
m = re.findall("\w+\\-\d+", text)
print m[0]

re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings.
findall returns list of strings. If you want the first one then use m[0].
print m[0] will give string without [] and ''.

If you only want the first (or only) result, do this:
import re
text = "this is my desc abc-3456"
m = re.search("\w+\\-\d+", text)
print m.group()

re.findall retuns a list of matches. In that list the result is a string. You can use re.finditer if you want.
In python, a list's representation is in brackets: [member1, member2, ...].
A string ("somestring") representation is in quotes: 'somestring'.
This means the representation of a list of strings is:
['somestring1', 'somestring2', ...]
So you have a string in a list, the characters you want to remove are a part of python's representation and not a part of the data you have.
To get the string simply take the first element from the list:
mystring = m[0]

regex search and replace with modified result

I have a string with tagged elements inside. I want to remove the tags and add some characters to the content inside the tags.
s = 'Hello there <something>, this is more text <tagged content>'
result = 'Hello there somethingADDED, this is more text tagged contentADDED
So far, I've tried
import re
result = re.search('\<(.*)\>', s)
result = result.group(1)
and s = s.split('>') and regex each substring one by one, but it doesn't seem like the correct or efficient way of doing this.

Use back-reference \1.
x="Hello there <something>, this is more text <tagged content>"
print re.sub(r"<([^>]*)>",r"\1added",x)
Output :Hello there somethingadded, this is more text tagged contentadded

Breaking up substrings in Python based on characters

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?

You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']

fields = inputString.split('"')
print fields[1], fields[3], fields[5]

You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to identfy pattern with arbitrary content python and remove it - python

Related

How to find specific pattern in a paragraph in Python?

How to remove text before a particular character or string in multi-line text?

How to extract just the characters "abc-3456" from the given text in python

regex search and replace with modified result

Breaking up substrings in Python based on characters

Categories

Resources