python regular expression also match special characters

python regular expression also match special characters - python

currently I use this simple script to search for a tag in the string;
tag = "#tag"
text = "test string with #tag inserted"
match = re.search(tag, text, re.IGNORECASE) #matches
now suppose the text contains an a-acute;
tag = "#tag"
text = "test string with #tág inserted"
match = re.search(tag, text, re.IGNORECASE) #does not match :(
How do I make this match work? should work for other special chars too (é, è, í, etc..)
Thanks in advance!

you can normalize the text with unidecode:
import unicodedata
tag = "#tag"
text = u"test string with #tág inserted and a #tag"
text=unidecode(text)
re.findall(tag, text, re.IGNORECASE)
out:
['#tag', '#tag']

Related

how to remove quoted words/phrases from a string using python?

I have a string that contains words or phrases that are enclosed in double quotes and I need to remove them from quotes., in python. Example:
The text has "single quotes" and "commas".
The text has "double quotes".
removing the words from the quotes results in this:
The text has " " and " ".
The text has " ".
I used the RE re.finditer that lists all the quotes found, but I know how it would be to remove the words that exist between the quotes in the string. Anybody know?

>> from re import sub
>> s
'The text has "single quotes" and "commas".'
>> sub('".*?"', '" "',s)
'The text has " " and " ".'

A bit complicated, but maybe,
(?<=")[^\s".][^"\r\n]*|[^"\r\n]*[^\s".](?=")
might be OK to look into.
RegEx Demo
This pattern would probably fail on some edge cases, which you'd likely want to look into:
[^\s".]
Test
import re
string = '''
The text has "single quotes" and "commas".
The text has "double quotes"
"single quotes" and "commas"
"double quotes"
"d"
"d""d""d""d"
'''
expression = r'(?<=")[^\s".][^"\r\n]*|[^"\r\n]*[^\s".](?=")'
print(re.sub(expression, '', string))
Output
The text has "" and "".
The text has ""
"" and ""
""
""
""""""""
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

Take a look at this simple regex:
"[\w\s]+"
Regex Demo
We capture any word characters and possible spaces between " ", and then replace with "":
expression = r'"[\w\s]+"'
print(re.sub(expression, '""', string))

you can use this code. Hope it helps.
text = 'The text has "single quotes" and "commas".'
text = re.sub('"[^"]*[$"]', '""', text)
print(text) # The text has "" and "".

RegEx for capturing part of a string

I am trying to grab top level Markdown headings (i.e., headings beginning with a single hash -- # Introduction) in an .md doc with Python's re library and cannot for the life of me figure this out.
Here is the code I'm trying to execute:
import re
pattern = r"(# .+?\\n)"
text = r"# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n"
header = re.search(pattern, text)
print(header.string)
The result from the print(header.string) is:
# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n whereas I only want # Title\n
This example on regex101 says it should work, but I can't figure out why it isn't. https://regex101.com/r/u4ZIE0/9

You get that result because you use header.string which is calling .string on a Match object which will give you back the string passed to match() or search().
The string already has newlines in it:
text = r"# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n"
So if you use your pattern (note that it will also match the newline), you could update your code to:
import re
pattern = r"(# .+?\\n)"
text = r"# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n"
header = re.search(pattern, text)
print(header.group())
Python demo
Note that re.search looks for the first location where the regex produces a match.
Another option to match your value could be matching from the start of the string a # followed by a space and then any character except a newline until the end of the string:
^# .*$
For example:
import re
pattern = r"^# .*$"
text = "# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n"
header = re.search(pattern, text, re.M)
print(header.group())
Python demo
If there can not be any more # following after, you might also use a negated character class to match not a # or a newline:
^# [^#\n\r]+$

I'm guessing that we are wishing to extract the # Title\n, which in that case, your expression seems to be working fine with a slight modification:
(# .+?\\n)(.+)
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(# .+?\\n)(.+)"
test_str = "# Title\\n## Chapter\\n### sub-chapter#### The Bar\\nIt was a fall day.\\n"
subst = "\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 1)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

get full string before and after a specific pattern

I'm looking to grab noise text that has a specific pattern in it:
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
I want to be able to remove everything in this sentence where after a space, and before a space contains &#.
result = "this is some text and some more text and some other stuff"
been trying:
re.compile(r'([\s]&#.*?([\s])).sub(" ", text)
I can't seem to get the first part though.

You may use
\S+&#\S+\s*
See a demo on regex101.com.
In Python:
import re
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
rx = re.compile(r'\S+&#\S+\s*')
text = rx.sub('', text)
print(text)
Which yields
this is some text and some more text and some other stuff

You can use this regex to capture that noise string,
\s+\S*&#\S*\s+
and replace it with a single space.
Here, \s+ matches any whitespace(s) then \S* matches zero or more non-whitespace characters while sandwiching &# within it and again \S* matches zero or more whitespace(s) and finally followed by \s+ one or more whitespace which gets removed by a space, giving you your intended string.
Also, if this noise string can be either at the very start or very end of string, feel free to change \s+ to \s*
Regex Demo
Python code,
import re
s = 'this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff'
print(re.sub(r'\s+\S*&#\S*\s+', ' ', s))
Prints,
this is some text and some more text and some other stuff

Try This:
import re
result = re.findall(r"[a-zA-z]+\&\#[a-zA-z]+", text)
print(result)
['lskdfmd&#kjansdl', 'sldkf&#lsakjd']
now remove the result list from the list of all words.
Edit1 Suggest by #Jan
re.sub(r"[a-zA-z]+\&\#[a-zA-z]+", '', text)
output: 'this is some text and some more text and some other stuff'
Edit2 Suggested by #Pushpesh Kumar Rajwanshi
re.sub(r" [a-zA-z]+\&\#[a-zA-z]+ ", " ", text)
output:'this is some text and some more text and some other stuff'

Remove numbers from string square bracket

I have a huge string which contains a lot of numbers in square brackets. For instance:
[1] this is an example
...
[123] another example
How can I remove the numbers and the brackets from my text string?
My current code to extract the text from a file:
text = txtFile.read()
text = str(text)
text = text.replace("\\n", " ")
text = " ".join(text.split())

Try using re.sub:
import re
text = txtFile.read()
text = str(text)
text = re.sub(r'\[\d+\]', '', text)
The regex pattern \[\d+\] should match any bracket term which has one or more numbers in it.
Note that re.sub by default will do a replacement against the entire input string.

Replace all text between 2 strings python

Lets say I have:
a = r''' Example
This is a very annoying string
that takes up multiple lines
and h#s a// kind{s} of stupid symbols in it
ok String'''
I need a way to do a replace(or just delete) and text in between "This" and "ok" so that when I call it, a now equals:
a = "Example String"
I can't find any wildcards that seem to work. Any help is much appreciated.

You need Regular Expression:
>>> import re
>>> re.sub('\nThis.*?ok','',a, flags=re.DOTALL)
' Example String'

Another method is to use string splits:
def replaceTextBetween(originalText, delimeterA, delimterB, replacementText):
leadingText = originalText.split(delimeterA)[0]
trailingText = originalText.split(delimterB)[1]
return leadingText + delimeterA + replacementText + delimterB + trailingText
Limitations:
Does not check if the delimiters exist
Assumes that there are no duplicate delimiters
Assumes that delimiters are in correct order

The DOTALL flag is the key. Ordinarily, the '.' character doesn't match newlines, so you don't match across lines in a string. If you set the DOTALL flag, re will match '.*' across as many lines as it needs to.

Use re.sub : It replaces the text between two characters or symbols or strings with desired character or symbol or string.
format: re.sub('A?(.*?)B', P, Q, flags=re.DOTALL)
where
A : character or symbol or string
B : character or symbol or string
P : character or symbol or string which replaces the text between A and B
Q : input string
re.DOTALL : to match across all lines
import re
re.sub('\nThis?(.*?)ok', '', a, flags=re.DOTALL)
output : ' Example String'
Lets see an example with html code as input
input_string = '''<body> <h1>Heading</h1> <p>Paragraph</p><b>bold text</b></body>'''
Target : remove <p> tag
re.sub('<p>?(.*?)</p>', '', input_string, flags=re.DOTALL)
output : '<body> <h1>Heading</h1> <b>bold text</b></body>'
Target : replace <p> tag with word : test
re.sub('<p>?(.*?)</p>', 'test', input_string, flags=re.DOTALL)
otput : '<body> <h1>Heading</h1> test<b>bold text</b></body>'

a=re.sub('This.*ok','',a,flags=re.DOTALL)

If you want first and last words:
re.sub(r'^\s*(\w+).*?(\w+)$', r'\1 \2', a, flags=re.DOTALL)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regular expression also match special characters - python

you can normalize the text with unidecode: import unicodedata tag = "#tag" text = u"test string with #tág inserted and a #tag" text=unidecode(text) re.findall(tag, text, re.IGNORECASE) out: ['#tag', '#tag']

Related

how to remove quoted words/phrases from a string using python?

RegEx for capturing part of a string

get full string before and after a specific pattern

Remove numbers from string square bracket

Replace all text between 2 strings python

Categories

Resources