python regular expression with utf8 issue

python regular expression with utf8 issue - python

I got a file which includes many lines of plain utf-8 text. Such as below, by the by, it's Chinese.
PROCESS：类型：关爱积分[NOTIFY] 交易号：2012022900000109 订单号：W12022910079166 交易金额：0.01元 交易状态：true 2012-2-29 10:13:08
The file itself was saved in utf-8 format. file name is xx.txt
here is my python code, env is python2.7
#coding: utf-8
import re
pattern = re.compile(r'交易金额：(\d+)元')
for line in open('xx.txt'):
match = pattern.match(line.decode('utf-8'))
if match:
print match.group()
The problematic thing here is I got no results.
I wanna get the decimal string from 交易金额：0.01元, in here, which is 0.01.
Why doesn't this code work? Can anyone explain it to me, I got no clue whatsoever.

There are several issues with your code. First you should use re.compile(ur'<unicode string>'). Also it is nice to add re.UNICODE flag (not sure if really needed here though). Next one is that still you will not receive a match since \d+ doesn't handle decimals just a series of numbers, you should use \d+\.?\d+ instead (you want number, probably a dot and a number). Example code:
#coding: utf-8
text = u"PROCESS：类型：关爱积分[NOTIFY] 交易号：2012022900000109 订单号：W12022910079166 交易金额：0.01元 交易状态：true 2012-2-29 10:13:08"
import re
pattern = re.compile(ur'交易金额：(\d+\.?\d+)元', re.UNICODE)
print pattern.search(text).group(1)

You need to use .search() since .match() is like starting your regex with ^, i.e. it only checks at the beginning of the string.

If you use utf-8, you can use flags=re.LOCALE
#coding: utf-8
import re
pattern = re.compile(r'交易金额：(\d+\.?\d+)元', flags=re.LOCALE)
for line in open('xx.txt'):
match = pattern.match(line)
More details, see re.LOCALE. There is no need to convert utf-8 to unicode.

Related

ElementTree.ParseError: reference to invalid character number

I get
ElementTree.ParseError: reference to invalid character number
when parsing XML that contains the following as a tag value: locat
My code looks like:
respXML = httpResponse.content
#also possible respXML = httpResponse.content.decode("utf-8")
#but both get the same error
#this line throws the error
respRoot = ET.fromstring(respXML)
How can I bulletproof my parser against seemingly invalid character numbers?

That looks like html. See if using the html package on the input string before anything else.
https://pypi.python.org/pypi/html
>>> import html
>>> test = "locat"
>>> html.unescape(test)
'local'
Then convert some known unicode characters to their equivalents. i.e
“ => "
’ => '
...
Finally replace double spaces to single space.
Since it'll be pretty cumbersome to address everything successfully upfront - I recommend placing specific exceptions and writing the bad line to file.
One by one address each error in the output file by adding more rules.
Good luck.

I sometimes find useful to save the original input characters with an regex pattern, such as (re.sub(r'&#([a-zA-Z0-9]+);?', r'[#\1;]', s). For example, with
from xml.etree import ElementTree as ET
import re
s = "<Tag>locat</Tag>"
using html.unescape produces
ET.fromstring(html.unescape(s)).text
#Out: 'locat'
but the regex pattern mentioned produces
ET.fromstring(re.sub(r'&#([a-zA-Z0-9]+);?', r'[#\1;]', s)).text
#Out: 'loca[#1;]t'
which preserves the "bad characters".

python-re.sub() and unicode

I want to replace all emoji with '' but my regEx doesn't work.For example,
content= u'?\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633?'
and I want to replace all the forms like \U0001f633 with '' so I write the code:
print re.sub(ur'\\U[0-9a-fA-F]{8}','',content)
But it doesn't work.
Thanks a lot.

You won't be able to recognize properly decoded unicode codepoints that way (as strings containing \uXXXX, etc.) Properly decoded, by the time the regex parser gets to them, each is a* character.
Depending on whether your python was compiled with only 16-bit unicode code points or not, you'll want a pattern something like either:
# 16-bit codepoints
re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# 32-bit* codepoints
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')
And your code would look like:
import re
# Pick a pattern, adjust as necessary
#re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')
content= u'[\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633]'
print(content)
stripped = re_strip.sub('', content)
print(stripped)
Both expressions, reduce the number of characters in the stripped string to 26.
These expressions strip out the emojis you were after, but may also strip out other things you do want. It may be worth reviewing a unicode codepoint range listing (e.g. here) and adjusting them.
You can determine whether your python install will only recognize 16-bit codepoints by doing something like:
import sys
print(sys.maxunicode.bit_length())
If this displays 16, you'll need the first regex expression. If it displays something greater than 16 (for me it says 21), the second one is what you want.
Neither expression will work when used on a python install with the wrong sys.maxunicode.
See also: this related.

How to find non-ascii characters in file using Regular Expression Python

I have a string of characters that includes [a-z] as well as á,ü,ó,ñ,å,... and so on. Currently I am using regular expressions to get every line in a file that includes these characters.
Sample of spanishList.txt:
adan
celular
tomás
justo
tom
átomo
camara
rosa
avion
Python code (charactersToSearch comes from flask #application.route('/<charactersToSearch>')):
print (charactersToSearch)
#'átdsmjfnueó'
...
#encode
charactersToSearch = charactersToSearch.encode('utf-8')
query = re.compile('[' + charactersToSearch + ']{2,}$', re.UNICODE).match
words = set(word.rstrip('\n') for word in open('spanishList.txt') if query(word))
...
When I do this, I am expecting to get the words in the text file that include the characters in charactersToSearch. It works perfectly for words without special characters:
...
#after doing further searching for other conditions, return list of found words.
return '<br />'.join(sorted(set(word for (word, path) in solve())))
>>> adan
>>> justo
>>> tom
Only problem is that it ignores all words in the file that aren't ASCII. I should also be getting tomás and átomo.
I've tried encode, UTF-8, using ur'[...], but I haven't been able to get it to work for all characters. The file and the program (# -*- coding: utf-8 -*-) are in utf-8 as well.

A different tack
I'm not sure how to fix it in your current workflow, so I'll suggest a different route.
This regex will match characters that are neither white-space characters nor letters in the extended ASCII range, such as A and é. In other words, if one of your words contains a weird character that is not part of this set, the regex will match.
(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S
Of course this will also match punctuation, but I'm assuming that we're only looking at words in an unpunctuated list. otherwise, excluding punctuation is not too hard.
As I see it, your challenge is to define your set.
In Python, you can so something like:
if re.search(r"(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S", subject):
# Successful match
else:
# Match attempt failed

I feel your pain. Dealing with Unicode in python2.x is the headache.
The problem with that input is that python sees "á" as the raw byte string '\xc3\xa1' instead of the unicode character "u'\uc3a1'. So your going to need to sanitize the input before passing the string into your regex.
To change a raw byte string to to a unicode string
char = "á"
## print char yields the infamous, and in python unparsable "\xc3\xa1".
## which is probably what the regex is not registering.
bytes_in_string = [byte for byte in char]
string = ''.join([str(hex(ord(byte))).strip('0x') for byte in bytes_in_string])
new_unicode_string = unichr(int(string),16))
There's probably a better way, because this is a lot of operations to get something ready for regex, which I think is supposed to be faster in some way than iterating & 'if/else'ing.
Dunno though, not an expert.
I used something similar to this to isolate the special char words when I parsed wiktionary which was a wicked mess. As far as I can tell your going to have to comb through that to clean it up anyways, you may as well just:
for word in file:
try:
word.encode('UTF-8')
except UnicodeDecodeError:
your_list_of_special_char_words.append(word)
Hope this helped, and good luck!
On further research found this post:
Bytes in a unicode Python string

The was able to figure out the issue. After getting the string from the flask app route, encode it otherwise it give you an error, and then decode the charactersToSearch and each word in the file.
charactersToSearch = charactersToSearch.encode('utf-8')
Then decode it in UTF-8. If you leave the previous line out it give you an error
UNIOnlyAlphabet = charactersToSearch.decode('UTF-8')
query = re.compile('[' + UNIOnlyAlphabet + ']{2,}$', re.U).match
Lastly, when reading the UTF-8 file and using query, don't forget to decode each word in the file.
words = set(word.decode('UTF-8').rstrip('\n') for word in open('spanishList.txt') if query(word.decode('UTF-8')))
That should do it. Now the results show regular and special characters.
justo
tomás
átomo
adan
tom

Python replace with re-using unknown strings

I have an XML in which I'd like to rename one of the tag groups like this:
<string>ABC</string>
<string>unknown string</string>
should be
<xyz>ABC</xyz>
<xyz>unknown string</xyz>
ABC is always the same, so that's no issue. However, "unknown string" is always different, but since I need this information extracted, I also want to keep the same string in the replacement.
Here's what I got so far:
import re
#open the xml file for reading:
file = open('path/file','r+')
#convert to string:
data = file.read()
file.write(re.sub("<string>ABC</string>(\s+)<string>(.*)</string>","<xyz>ABC</xyz>[\1]<xyz>[\2]</xyz>",data))
print (data)
file.close()
I tried to use capture groups, but didn't do it correctly. The string is replaced with weird symbols in my XML. Plus, it's printed twice. I have both the unchanged and the changed version in my XML, which I don't want.

The problem you're experiencing is not due to your regex pattern. The backslash (\) in the strings are escaping proceeding characters thus resulting in the weird symbols that you see.
>>> print "hello\1world"
helloworld
>>> print r"hello\1world"
hello\1world
Always use the raw string notation to define your re patterns.
>>> data = """
... <string>ABC</string>
... <string>unknown string</string>
... """
>>> print re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data)
<xyz>ABC</xyz>
<xyz>unknown string</xyz>

Why are you including the content in your replacement operation? All you need to do is:
Replace <string> by <xyz>.
Replace </string> by </xyz>.
It would take two operations but the intent of your code would be clear and you don't need to know what unknown string is.

how to turn 'C:\Music\song.mp3' into r'C:\Music\song.mp3'

I have been making an mp3 player with Tkinter and the module mp3play.
Say i had the song to play: C:\Music\song.mp3
and to play that song i have to run this script:
import mp3play
music_file=r'C:\Music\song.mp3'
clip = mp3play.load(music_file)
clip.play()
Easy enough, my problem though is getting the "r" there.
i have tried:
import mp3play
import re
music_file="'C:\Music\song.mp3'"
music_file='r'+music_file
music_file=re.sub('"','',music_file)
print music_file
clip = mp3play.load(music_file)
clip.play()
Which gets the output: r'C:\Music\song.mp3'
but it is a string, so it wont read the file.

The 'r' in the front denotes a particular category of string called raw string. You can't get that by adding two strings or re substituting a string. It is just a string type, but with the escape characters take care.
>>> s = r'something'
>>> s
'something'
>>>
When you are writing the script, use the 'r', if you are getting the input via raw_input, python will take care of escaping the characters. So, the question is why are you trying to do that?

try:
music_file='C:/Music/song.mp3'

In Python, the r prefix introduces a raw string. Outside of raw strings, backslash (\) characters are considered as escape characters and have to be escaped themselves (by doubling them).
Try a simple string instead:
music_file = 'C:\\Music\\song.mp3'

The r you are talking about has to be placed before a string definition, and tells python that the following string is "raw", meaning it will ignore backslash escapes (so it doesn't error on invalid backslashes in filenames, for example).
Why don't you just do it like in the first example? I don't see what you are trying to accomplish in the second example.

you can try music_file = r'%s' % path_to_file

As a few of the other answers have pointed out (I'm just posting this as an answer because it seemed kind of silly to make it a comment), what you've given in your first code block is exactly what the contents of your script should be. You don't need to do anything special to get the r there. In fact the 'r' is not part of the string, it's part of the code that makes the string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regular expression with utf8 issue - python

You need to use .search() since .match() is like starting your regex with ^, i.e. it only checks at the beginning of the string.

If you use utf-8, you can use flags=re.LOCALE #coding: utf-8 import re pattern = re.compile(r'交易金额：(\d+\.?\d+)元', flags=re.LOCALE) for line in open('xx.txt'): match = pattern.match(line) More details, see re.LOCALE. There is no need to convert utf-8 to unicode.

Related

ElementTree.ParseError: reference to invalid character number

python-re.sub() and unicode

How to find non-ascii characters in file using Regular Expression Python

Python replace with re-using unknown strings

how to turn 'C:\Music\song.mp3' into r'C:\Music\song.mp3'

Categories

Resources