Python \ufffd after replacement with Chinese content - python

After we found the answer to this question we are faced with next unusual replacement behavior:
Our regex is:
[\\((\\[{【]+(\\w+|\\s+|\\S+|\\W+)?[)\\)\\]}】]+
We are trying to match all content inside any type of brackets including the brackets
The original text is:
物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))
The result is:
物�研真题详解
The code for the replacement is:
delimiter = ' '
if localization == 'CN':
delimiter = ''
p = re.compile(codecs.encode(unicode(regex), "utf-8"), flags=re.I)
columnString = (p.sub(delimiter, columnString).strip()
Why � ( \ufffd) character appear and how to fix such behavior?
Same problem we are faced when we used regex:
(\\d*[满|元])
print repr(columnString)='\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'
print repr(regex)=u'[\\(\uff08\\[{\u3010]+(\\w+|\\s+|\\S+|\\W+)?[\uff09\\)\\]}\u3011]+'
print repr(p.pattern)='[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+'

You should not mix UTF-8 and regular expressions. Process all your text as Unicode. Make sure you decoded both the regex and the input string to unicode values first:
>>> import re
>>> columnString = '\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'
>>> regex = '[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+'
>>> utf8_compiled = re.compile(regex, flags=re.I)
>>> utf8_compiled.sub('', columnString)
'\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4'
>>> print utf8_compiled.sub('', columnString).decode('utf8', 'replace')
当代骨�
>>> unicode_compiled = re.compile(regex.decode('utf8'), flags=re.I | re.U)
>>> unicode_compiled.sub('', columnString.decode('utf8'))
u'\u5f53\u4ee3\u9aa8\u4f24\u79d1\u5999\u65b9'
>>> print unicode_compiled.sub('', columnString.decode('utf8'))
当代骨伤科妙方
>>> print unicode_compiled.sub('', u'物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))')
物理化学名校考研真题详解
When using UTF-8 in your pattern consists of separate bytes for the 【 codepoint:
>>> '【'
'\xe3\x80\x90'
which means your character class matches any of those bytes; \xe3, or \x80 or \x90 are each separately valid bytes in that character class.

Decode your string first , and you can get rid of that � ( \ufffd) character .
In [1]: import re
...: subject = '物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))'.decode('utf-8')
...: reobj = re.compile(r"[\((\[{【]+(\w+|\s+|\S+|\W+)?[)\)\]}】]+", re.IGNORECASE | re.MULTILINE)
...: result = reobj.sub("", subject)
...: print result
...:
物理化学名校考研真题详解

Related

How to replace/delete a string in python

how can I replace/delete a part of a string, like this
string = '{DDBF1F} this is my string {DEBC1F}'
#{DDBF1F} the code between Parentheses is random, I only know it is made out of 6 characters
the output should be
this is my string
I tried this, I know it doesn't work, but I tried :3
string = '{DDBF1F} Hello {DEBC1F}'
string.replace(f'{%s%s%s%s%s%s}', 'abc')
print(string)
Use the re library to perform a regex replace, like this:
import re
text = '{DDBF1F} Hello {DEBC1F}'
result = re.sub(r"(\s?\{[A-F0-9]{6}\}\s?)", "", text)
print(result)
If the length of the strings within the brackets is fixed, you can use slicing to get the inner substring:
>>> string = '{DDBF1F} this is my string {DEBC1F}'
>>> string[8:-8]
' this is my string '
(string[9:-9] if you want to remove the surrounding spaces)
If hardcoding the indexes feels bad, they can be derived using str.index (if you can be certain that the string will not contain an embedded '}'):
>>> start = string.index('}')
>>> start
7
>>> end = string.index('{', start)
>>> end
27
>>> string[start+1:end]
' this is my string '
This code works
string = '{DDBF1F} this is my string {DEBC1F}'
st=string.split(' ')
new_str=''
for i in st:
if i.startswith('{') and i.endswith('}'):
pass
else:
new_str=new_str+" "+ i
print(new_str)

Python how to remove escape characters from a string

I have a string like below, and I want to remove all \x06 characters from the string in Python.
Ex:
s = 'test\x06\x06\x06\x06'
s1 = 'test2\x04\x04\x04\x04'
print(literal_eval("'%s'" % s))
output:
test♠♠♠♠
I just need String test and remove all \xXX.
Maybe the regex module is the way to go
>>> s = 'test\x06\x06\x06\x06'
>>> s1 = 'test2\x04\x04\x04\x04'
>>> import re
>>> re.sub('[^A-Za-z0-9]+', '', s)
'test'
>>> re.sub('[^A-Za-z0-9]+', '', s1)
'test2'
If you want to remove all \xXX characters (non-printable ascii characters) the best way is probably like so
import string
def remove_non_printable(s):
return ''.join(c for c in s if c not in string.printable)
Note this won't work with any non-ascii printable characters (like é, which will be removed).
This should do it
import re #Import regular expressions
s = 'test\x06\x06\x06\x06' #Input s
s1 = 'test2\x04\x04\x04\x04' #Input s1
print(re.sub('\x06','',s)) #remove all \x06 from s
print(re.sub('\x04','',s1)) #remove all \x04 from s1

python 2.7 regular expression match with \r\n in string [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 7 months ago.
Wondering if any ways to match string contains \r \n? It seems the same regular expression match does not work if input string content contains \r \n. Using Python 2.7.
works pretty good,
import re
content = '{(1) hello (1)}'
reg = '{\(1\)(.*?)\(1\)}'
results = re.findall(reg, content)
print results[0]
prog = re.compile(reg)
results = prog.findall(content)
print results[0]
will not work when add \r \n
import re
content = '{(1) hello \r\n (1)}'
reg = '{\(1\)(.*?)\(1\)}'
results = re.findall(reg, content)
print results[0]
prog = re.compile(reg)
results = prog.findall(content)
print results[0]
regards,
Lin
This works:
>>> import re
>>>
>>> content = '{(1) hello \r\n (1)}'
>>> reg = '{\(1\)(.*?)\(1\)}'
>>> results = re.findall(reg, content, re.DOTALL)
>>>
>>> print results[0]
hello
>>>
>>> prog = re.compile(reg, re.DOTALL)
>>> results = prog.findall(content)
>>>
>>> print results[0]
hello
>>>
From Python Docs:
'.' (Dot.) In the default mode, this matches any character except a
newline. If the DOTALL flag has been specified, this matches any
character including a newline.

regex to match a word and everything after it?

I need to dump some http data as a string from the http packet which i have in string format am trying to use the regular expression below to match 'data:'and everything after it,Its not working . I am new to regex and python
>>>import re
>>>pat=re.compile(r'(?:/bdata:/b)?\w$')
>>>string=" dnfhndkn data: ndknfdjoj pop"
>>>res=re.match(pat,string)
>>>print res
None
re.match matches only at the beginning of the string. Use re.search to match at any position. (See search() vs. match())
>>> import re
>>> pat = re.compile(r'(?:/bdata:/b)?\w$')
>>> string = " dnfhndkn data: ndknfdjoj pop"
>>> res = re.search(pat,string)
>>> res
<_sre.SRE_Match object at 0x0000000002838100>
>>> res.group()
'p'
To match everything, you need to change \w with .*. Also remove /b.
>>> import re
>>> pat = re.compile(r'(?:data:).*$')
>>> string = " dnfhndkn data: ndknfdjoj pop"
>>> res = re.search(pat,string)
>>> print res.group()
data: ndknfdjoj pop
No need for a regular expression here. You can just slice the string:
>>> string
' dnfhndkn data: ndknfdjoj pop'
>>> string.index('data')
10
>>> string[string.index('data'):]
'data: ndknfdjoj pop'
str.index('data') returns the point in the string where the substring data is found. The slice from this position to the end string[10:] gives you the part of the string you are interested in.
By the way, string is a potentially problematic variable name if you are planning on using the string module at any point...
you can just do:
string.split("data:")[1]
assuming "data:" appears only once in each string

Python testing if whitespace(\s) symbol is in string

I have a problem testing if "\s" symbols are present in a string. For example is '\sgoogle\s.com' must show that there is.
# use raw strings to ignore escapes:
s = r'\sgoogle\s'
print s, s.find(r'\s') != -1
# and with regex:
import re
print re.search(r'\\s', s)
Gives:
\sgoogle\s True
<_sre.SRE_Match object at 0x7f2d696fc850>
You have to escape your \ as follows:
s = '\sgoogle\s.com'
re.search(r'(\\s)', s)
Demo: http://regex101.com/r/tF9oW6

Categories