python 2.7 regular expression match with \r\n in string [duplicate] - python

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 7 months ago.
Wondering if any ways to match string contains \r \n? It seems the same regular expression match does not work if input string content contains \r \n. Using Python 2.7.
works pretty good,
import re
content = '{(1) hello (1)}'
reg = '{\(1\)(.*?)\(1\)}'
results = re.findall(reg, content)
print results[0]
prog = re.compile(reg)
results = prog.findall(content)
print results[0]
will not work when add \r \n
import re
content = '{(1) hello \r\n (1)}'
reg = '{\(1\)(.*?)\(1\)}'
results = re.findall(reg, content)
print results[0]
prog = re.compile(reg)
results = prog.findall(content)
print results[0]
regards,
Lin

This works:
>>> import re
>>>
>>> content = '{(1) hello \r\n (1)}'
>>> reg = '{\(1\)(.*?)\(1\)}'
>>> results = re.findall(reg, content, re.DOTALL)
>>>
>>> print results[0]
hello
>>>
>>> prog = re.compile(reg, re.DOTALL)
>>> results = prog.findall(content)
>>>
>>> print results[0]
hello
>>>
From Python Docs:
'.' (Dot.) In the default mode, this matches any character except a
newline. If the DOTALL flag has been specified, this matches any
character including a newline.

Related

How to replace/delete a string in python

how can I replace/delete a part of a string, like this
string = '{DDBF1F} this is my string {DEBC1F}'
#{DDBF1F} the code between Parentheses is random, I only know it is made out of 6 characters
the output should be
this is my string
I tried this, I know it doesn't work, but I tried :3
string = '{DDBF1F} Hello {DEBC1F}'
string.replace(f'{%s%s%s%s%s%s}', 'abc')
print(string)
Use the re library to perform a regex replace, like this:
import re
text = '{DDBF1F} Hello {DEBC1F}'
result = re.sub(r"(\s?\{[A-F0-9]{6}\}\s?)", "", text)
print(result)
If the length of the strings within the brackets is fixed, you can use slicing to get the inner substring:
>>> string = '{DDBF1F} this is my string {DEBC1F}'
>>> string[8:-8]
' this is my string '
(string[9:-9] if you want to remove the surrounding spaces)
If hardcoding the indexes feels bad, they can be derived using str.index (if you can be certain that the string will not contain an embedded '}'):
>>> start = string.index('}')
>>> start
7
>>> end = string.index('{', start)
>>> end
27
>>> string[start+1:end]
' this is my string '
This code works
string = '{DDBF1F} this is my string {DEBC1F}'
st=string.split(' ')
new_str=''
for i in st:
if i.startswith('{') and i.endswith('}'):
pass
else:
new_str=new_str+" "+ i
print(new_str)

Python regex if all whole words in string [duplicate]

This question already has answers here:
Do regular expressions from the re module support word boundaries (\b)?
(5 answers)
Closed 4 years ago.
I have the following a string, I need to check if
the string contains App2 and iPhone,
but not App and iPhone
I wrote the following:
campaign_keywords = "App2 iPhone"
my_string = "[Love]App2 iPhone Argentina"
pattern = re.compile("r'\b" + campaign_keywords + "\b")
print pattern.search(my_string)
It prints None. Why?
The raw string notation is wrong, the r should not be inside the the quotes. and the second \b should also be a raw string.
The match function tries to match at the start of the string. You need to use search or findall
Difference between re.search and re.match
Example
>>> pattern = re.compile(r"\b" + campaign_keywords + r"\b")
>>> pattern.findall(my_string)
['App2 iPhone']
>>> pattern.match(my_string)
>>> pattern.search(my_string)
<_sre.SRE_Match object at 0x10ca2fbf8>
>>> match = pattern.search(my_string)
>>> match.group()
'App2 iPhone'

Python \ufffd after replacement with Chinese content

After we found the answer to this question we are faced with next unusual replacement behavior:
Our regex is:
[\\((\\[{【]+(\\w+|\\s+|\\S+|\\W+)?[)\\)\\]}】]+
We are trying to match all content inside any type of brackets including the brackets
The original text is:
物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))
The result is:
物�研真题详解
The code for the replacement is:
delimiter = ' '
if localization == 'CN':
delimiter = ''
p = re.compile(codecs.encode(unicode(regex), "utf-8"), flags=re.I)
columnString = (p.sub(delimiter, columnString).strip()
Why � ( \ufffd) character appear and how to fix such behavior?
Same problem we are faced when we used regex:
(\\d*[满|元])
print repr(columnString)='\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'
print repr(regex)=u'[\\(\uff08\\[{\u3010]+(\\w+|\\s+|\\S+|\\W+)?[\uff09\\)\\]}\u3011]+'
print repr(p.pattern)='[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+'
You should not mix UTF-8 and regular expressions. Process all your text as Unicode. Make sure you decoded both the regex and the input string to unicode values first:
>>> import re
>>> columnString = '\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'
>>> regex = '[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+'
>>> utf8_compiled = re.compile(regex, flags=re.I)
>>> utf8_compiled.sub('', columnString)
'\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4'
>>> print utf8_compiled.sub('', columnString).decode('utf8', 'replace')
当代骨�
>>> unicode_compiled = re.compile(regex.decode('utf8'), flags=re.I | re.U)
>>> unicode_compiled.sub('', columnString.decode('utf8'))
u'\u5f53\u4ee3\u9aa8\u4f24\u79d1\u5999\u65b9'
>>> print unicode_compiled.sub('', columnString.decode('utf8'))
当代骨伤科妙方
>>> print unicode_compiled.sub('', u'物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))')
物理化学名校考研真题详解
When using UTF-8 in your pattern consists of separate bytes for the 【 codepoint:
>>> '【'
'\xe3\x80\x90'
which means your character class matches any of those bytes; \xe3, or \x80 or \x90 are each separately valid bytes in that character class.
Decode your string first , and you can get rid of that � ( \ufffd) character .
In [1]: import re
...: subject = '物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))'.decode('utf-8')
...: reobj = re.compile(r"[\((\[{【]+(\w+|\s+|\S+|\W+)?[)\)\]}】]+", re.IGNORECASE | re.MULTILINE)
...: result = reobj.sub("", subject)
...: print result
...:
物理化学名校考研真题详解

regex to match a word and everything after it?

I need to dump some http data as a string from the http packet which i have in string format am trying to use the regular expression below to match 'data:'and everything after it,Its not working . I am new to regex and python
>>>import re
>>>pat=re.compile(r'(?:/bdata:/b)?\w$')
>>>string=" dnfhndkn data: ndknfdjoj pop"
>>>res=re.match(pat,string)
>>>print res
None
re.match matches only at the beginning of the string. Use re.search to match at any position. (See search() vs. match())
>>> import re
>>> pat = re.compile(r'(?:/bdata:/b)?\w$')
>>> string = " dnfhndkn data: ndknfdjoj pop"
>>> res = re.search(pat,string)
>>> res
<_sre.SRE_Match object at 0x0000000002838100>
>>> res.group()
'p'
To match everything, you need to change \w with .*. Also remove /b.
>>> import re
>>> pat = re.compile(r'(?:data:).*$')
>>> string = " dnfhndkn data: ndknfdjoj pop"
>>> res = re.search(pat,string)
>>> print res.group()
data: ndknfdjoj pop
No need for a regular expression here. You can just slice the string:
>>> string
' dnfhndkn data: ndknfdjoj pop'
>>> string.index('data')
10
>>> string[string.index('data'):]
'data: ndknfdjoj pop'
str.index('data') returns the point in the string where the substring data is found. The slice from this position to the end string[10:] gives you the part of the string you are interested in.
By the way, string is a potentially problematic variable name if you are planning on using the string module at any point...
you can just do:
string.split("data:")[1]
assuming "data:" appears only once in each string

Search for quotes with regular expression

I'm looking for a way to search a text file for quotes made by author and then print them out. My script so far:
import re
#searches end of string
print re.search('"$', 'i am searching for quotes"')
#searches start of string
print re.search('^"' , '"i am searching for quotes"')
What I would like to do
import re
## load text file
quotelist = open('A.txt','r').read()
## search for strings contained with quotation marks
re.search ("-", quotelist)
## Store in list or Dict
Dict = quotelist
## Print quotes
print Dict
I also tried
import re
buffer = open('bbc.txt','r').read()
quotes = re.findall(r'.*"[^"].*".*', buffer)
for quote in quotes:
print quote
# Add quotes to list
l = []
for quote in quotes:
print quote
l.append(quote)
Develop a regular expression that matches all the expected characters you would expect to see inside of a quoted string. Then use the python method findall in re to find all occurrences of the match.
import re
buffer = open('file.txt','r').read()
quotes = re.findall(r'"[^"]*"',buffer)
for quote in quotes:
print quote
Searching between " and ” requires a unicode-regex search such as:
quotes = re.findall(ur'"[^\u201d]*\u201d',buffer)
And for a document that uses " and ” interchangeably for quotation termination
quotes = re.findall(ur'"[^"^\u201d]*["\u201d]', buffer)
You don't need regular expressions to find static strings. You should use this Python idiom for finding strings:
>>> haystack = 'this is the string to search!'
>>> needle = '!'
>>> if needle in haystack:
print 'Found', needle
Creating a list is easy enough -
>>> matches = []
Storing matches is easy too...
>>> matches.append('add this string to matches')
This should be enough to get you started. Good luck!
An addendum to address the comment below...
l = []
for quote in matches:
print quote
l.append(quote)

Categories