How to include Non-ascii characters in regular expression in Python - python

I have a text file, which I'm reading line by line. In each line, if there are special characters, then I'm removing the special characters, for this, I'm using the help of regular expressions.
fh = open(r"abc.txt","r+")
data = fh.read()
#print re.sub(r'\W+', '', data)
new_str = re.sub('[^a-zA-Z0-9\n\.;,?!$]', ' ', data)
So, here in my data, I'm keeping only the alphanumeric words along with few special symbols which are [.;,?!$], but along with it I also want Euro symbol(€), pound (£), Japanese yen(¥) and Rupee symbol(₹). But these are not present in ASCII characters, so when I include them in my regular expression like - re.sub('[^a-zA-Z0-9\n.;,?!$€₹¥]', ' ', data) it gives an error message.
SyntaxError: Non-ASCII character '\xe2' in file preprocess.py on line 23, but no encoding declared

You can make use of Unicode character escapes. For example, the Euro character above can be represented as \u20ac. The four digit number is the Unicode number, irrespective of encoding types. In an example regex, this might look like:
[^a-zA-Z0-9\u20ac]

Maybe not the solution, but potentially a partial solution. Use this as the first two lines of each of your Python 2 files:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
This makes Python 2 switch to UTF-8 (unicode) mode. In Python 3 this is the default.

Related

Python 2 regex operations from re package not handling utf-8 diacritic encoding

I'm performing a series of regex operations on a utf-8 encoded text file that contains a list of lines containing alphabetic and non-alphabetic characters, including non-Latin characters with diacritics. This is a snippet from the file (notice the non-Latin characters):
oro[=]sia[=]łeś
oszust[=]ką
My script first opens the text file, reads each line and strips the unnecessary characters. My regex operations then first catch a word matching a specified pattern and then either insert of adjust the position of the non-alphabetical character groups [=]. This is a snippet from my script:
# -*- coding: utf-8 -*-
import re
with open(r'...\input.txt', "rb") as input, open(r'...\output.txt', "wb") as output:
for line in input:
word = line.strip('\r\n')
# Rule 1: ^VCV -> V[=]CV
match = re.match('^[AEIOUYaeiouy]([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy].*(.*\[=\].*)*', word)
result = match.group() if match else None
if result == word:
word = re.sub('(?<=^[AEIOUYaeiouy])(?=([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy])', '[=]', word)
outLine = word + "\n"
errorList.write(outLine)
The rule seems to fail with inputs with rule environments that involve non-Latin characters with diacritics. For example, when the input to Rule 1 above is 'oszust[=]ką', re.match.group() re-encodes it as 'oszust[=]k\xc4'. Converting the last character changes the environment and matches the input for the following regex operation.
The problem clearly lies in utf-8 encoding, because the script manages to process oro[=]sia[=]łeś, where the rule environment does not contain characters with diacritics, just fine. Having read this website, I tried re-encoding the input to utf-8 so that it meets the environment for the regex operation, but instead I get this error:
'ascii' codec can't decode byte 0xc4 in position 10: ordinal not in range(128)
Why does the error mention ascii if I'm trying to encode it as utf-8? How can I modify the encoding so that it meets the environment required for the regex operation?
When dealing with Unicode characters, use Unicode strings. Convert to/from Unicode strings at the I/O boundaries of your program. Switch to the latest Python 3 if possible. It handles Unicode much better.
# -*- coding: utf-8 -*-
import re
import io
with io.open('input.txt', 'r', encoding='utf8') as input, \
io.open('output.txt', 'w', encoding='utf8') as output:
for line in input:
word = line.strip() # this will remove all leading/trailing whitespace.
# Rule 1: ^VCV -> V[=]CV
match = re.match(u'^[AEIOUYaeiouy]([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy].*(.*\[=\].*)*', word)
result = match.group() if match else None
if result == word:
word = re.sub(u'(?<=^[AEIOUYaeiouy])(?=([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy])', u'[=]', word)
outLine = word + u'\n'
output.write(outLine)

Running Python 2.7 Code With Unicode Characters in Source

I want to run a Python source file that contains unicode (utf-8) characters in the source. I am aware of the fact that this can be done by adding the comment # -*- coding: utf-8 -*- in the beginning. However, I wish to do it without using this method.
One way I could think of was writing the unicode strings in escaped form. For example,
Edit: Updated Source. Added Unicode comments.
# Printing naïve and 男孩
def fxn():
print 'naïve'
print '男孩'
fxn()
becomes
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
I have two questions regarding the above method.
How do I convert the first code snippet, using Python, into its equivalent that
follows it? That is, only unicode sequences should be written in
escaped form.
Is the method foolproof considering only unicode (utf-8) characters are used? Is there something that can go wrong?
Your idea is generally sound but will break in Python 3 and will cause a headache when you manipulating and writing your strings in Python 2.
It's a good idea to use Unicode strings, not regular strings when dealing with non-ASCII.
Instead, you can encode your characters as Unicode (not UTF-8) escape sequences in Unicode strings.
u'na\xefve'
u'\u7537\u5b69'
note the u prefix
Your code is now encoding agnostic.
If you only use byte strings, and save your source file encoded as UTF-8, your byte strings will contain UTF-8-encoded data. No need for the coding statement (although REALLY strange that you don't want to use it...it's just a comment). The coding statement let's Python know the encoding of the source file, so it can decode Unicode strings correctly (u'xxxxx'). If you have no Unicode strings, it doesn't matter.
For your questions, no need to convert to escape codes. If you encode the file as UTF-8, you can use the more readable characters in your byte strings.
FYI, that won't work for Python 3, because byte strings cannot contain non-ASCII in that version.
That said, here's some code that will convert your example as requested. It reads the source assuming it is encoded in UTF-8, then uses a regular expression to locate all non-ASCII characters. It passes them through a conversion function to generate the replacement. This should be safe, since non-ASCII can only be used in string literals and constants in Python 2. Python 3, however, allows non-ASCII in variable names so this wouldn't work there.
import io
import re
def escape(m):
char = m.group(0).encode('utf8')
return ''.join(r'\x{:02x}'.format(ord(b)) for b in char)
with io.open('sample.py',encoding='utf8') as f:
content = f.read()
new_content = re.sub(r'[^\x00-\x7f]',escape,content)
with io.open('sample_new.py','w',encoding='utf8') as f:
f.write(new_content)
Result:
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
question 1:
try to use:
print u'naïve'
print u'长者'
question 2:
If you type the sentences by keyboard and Chinese input software, everything should be OK. But if you copy and paste sentence from some web pages, you should consider other encode format such as GBK,GB2312 and GB18030
This snippet of Python 3 should convert your program correctly to work in Python 2.
def convertchar(char): #converts individual characters
if 32<=ord(char)<=126 or char=="\n": return char #if normal character, return it
h=hex(ord(char))[2:]
if ord(char)<256: #if unprintable ASCII
h=" "*(2-len(h))+h
return "\\x"+h
elif ord(char)<65536: #if short unicode
h=" "*(4-len(h))+h
return "\\u"+h
else: #if long unicode
h=" "*(8-len(h))+h
return "\\U"+h
def converttext(text): #converts a chunk of text
newtext=""
for char in text:
newtext+=convertchar(char)
return newtext
def convertfile(oldfilename,newfilename): #converts a file
oldfile=open(oldfilename,"r")
oldtext=oldfile.read()
oldfile.close()
newtext=converttext(oldtext)
newfile=open(newfilename,"w")
newfile.write(newtext)
newfile.close()
convertfile("FILE_TO_BE_CONVERTED","FILE_TO_STORE_OUTPUT")
First a simple remarl: as you are using byte strings in a Python2 script, the # -*- coding: utf-8 -*- has simply no effect. It only helps to convert the source byte string to an unicode string if you had written:
# -*- coding: utf-8 -*-
...
utxt = u'naïve' # source code is the bytestring `na\xc3\xafve'
# but utxt must become the unicode string u'na\xefve'
Simply it might be interpreted by clever editors to automatically use a utf8 charset.
Now for the actual question. Unfortunately, what you are asking for is not really trivial: idenfying in a source file what is in a comment and in a string simply requires a Python parser... And AFAIK, if you use the parser of ast modules you will lose your comments except for docstrings.
But in Python 2, non ASCII characters are only allowed in comments and litteral strings! So you can safely assume that if the source file is a correct Python 2 script containing no litteral unicode string(*), you can safely transform any non ascii character in its Python representation.
A possible Python function reading a raw source file from a file object and writing it after encoding in another file object could be:
def src_encode(infile, outfile):
while True:
c = infile.read(1)
if len(c) < 1: break # stop on end of file
if ord(c) > 127: # transform high characters
c = "\\x{:2x}".format(ord(c))
outfile.write(c)
An nice property is that it works whatever encoding you use, provided the source file is acceptable by a Python interpreter and does not contain high characters in unicode litterals(*), and the converted file will behave exactly the same as the original one...
(*) A problem will arise if you use unicode litterals in an encoding other that Latin1, because the above function will behave as if the file contained the declaration # -*- coding: Latin1 -*-: u'é' will be translated correctly as u'\xe9' if original encoding is latin1 but as u'\xc3\xc9' (not what is expected...) if original encoding is utf8, and I cannot imagine a way to process correctly both litteral byte strings and unicode byte strings without fully parsing the source file...

How to read Chinese files?

I'm stuck with all this confusing encoding stuff. I have a file containing Chinese subs. I actually believe it is UTF-8 because using this in Notepad++ gives me a very good result. If I set gb2312 the Chinese part is still fine, but I will see some UTF8 code not being converted.
The goal is to loop through the text in the file and count how many times the different chars come up.
import os
import re
import io
character_dict = {}
for dirname, dirnames, filenames in os.walk('.'):
for filename in filenames:
if "srt" in filename:
import codecs
f = codecs.open(filename, 'r', 'gb2312', errors='ignore')
s = f.read()
# deleting {}
s = re.sub('{[^}]+}', '', s)
# deleting every line that does not start with a chinese char
s = re.sub(r'(?m)^[A-Z0-9a-z].*\n?', '', s)
# delete non chinese chars
s = re.sub(r'[\s\.A-Za-z0-9\?\!\\/\-\"\,\*]', '', s)
#print s
s = s.encode('gb2312')
print s
for c in s:
#print c
pass
This will actually give me the complete Chinese text. But when I print out the loop on the bottom I just get questionmarks instead of the single chars.
Also note I said it is UTF8, but I have to use gb2312 for encoding and as the setting in my gnome-terminal. If I set it to UTF8 in the code i just get trash no matter if I set my terminal to UTF8 or gb2312. So maybe this file is not UTF8 after all!?
In any case s contains the full Chinese text. Why can't I loop it?
Please help me to understand this. It is very confusing for me and the docs are getting me nowhere. And google just leads me to similar problems that somebody solves, but there is no explanation so far that helped me understand this.
gb2312 is a multi-byte encoding. If you iterate over a bytestring encoded with it, you will be iterating over the bytes, not over the characters you want to be counting (or printing). You probably want to do your iteration on the unicode string before encoding it. If necessary, you can encode the individual codepoints (characters) to their own bytestrings for output:
# don't do s = s.encode('gb2312')
for c in s: # iterate over the unicode codepoints
print c.encode('gb2312') # encode them individually for output, if necessary
You are printing individual bytes. GB2312 is a multi-byte encoding, and each codepoint uses 2 bytes. Printing those bytes individually won't produce valid output, no.
The solution is to not encode from Unicode to bytes when printing. Loop over the Unicode string instead:
# deleting {}
s = re.sub('{[^}]+}', '', s)
# deleting every line that does not start with a chinese char
s = re.sub(r'(?m)^[A-Z0-9a-z].*\n?', '', s)
# delete non chinese chars
s = re.sub(r'[\s\.A-Za-z0-9\?\!\\/\-\"\,\*]', '', s)
#print s
# No `s.encode()`!
for char in s:
print char
You could encode each char chararter individually:
for char in s:
print char
But if you have your console / IDE / terminal correctly configured you should be able to print directly without errors, especially since your print s.encode('gb2312)` produces correct output.
You also appear to be confusing UTF-8 (an encoding) with the Unicode standard. UTF-8 can be used to represent Unicode data in bytes. GB2312 is an encoding too, and can be used to represent a (subset of) Unicode text in bytes.
You may want to read up on Python and Unicode:
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

How to cope with diacritics while trying to match with regex in Python

Trying to use regular expression with unicode html escapes for diacritics:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
htmlstring=u'''/">čćđš</a>.../">España</a>'''
print re.findall( r'/">(.*?)</a', htmlstring, re.U )
produces :
[u'\u010d\u0107\u0111\u0161', u'Espa\xf1a']
Any help, please?
This appears to be an encoding question. Your code is working as it should. Were you expecting something different? Your strings that are prefixed with u are unicode literals. The characters that begin with \u are unicode characters followed by four hex digits, whereas the characters that begin with \x are unicode characters followed by only two hex digits. If you print out your results (instead of looking at their __repr__ method), you will see that you have received the result that it appears you were looking for:
results = [u'\u010d\u0107\u0111\u0161', u'Espa\xf1a']
for result in results:
print result
čćđš
España
In your code (i.e. in your list), you see the representation of these unicode literals:
for result in results:
print result.__repr__()
u'\u010d\u0107\u0111\u0161' # what shows up in your list
u'Espa\xf1a'
Incidentally, it appears that you are trying to parse html with regexes. You should try BeautifulSoup or something similar instead. It will save you a major headache down the road.

How to make the python interpreter correctly handle non-ASCII characters in string operations?

I have a string that looks like so:
6 918 417 712
The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get:
s.replace('Â ', '')
That should do the trick. But of course it complains that the non-ASCII character '\xc2' in file blabla.py is not encoded.
I never quite could understand how to switch between different encodings.
Here's the code, it really is just the same as above, but now it's in context. The file is saved as UTF-8 in notepad and has the following header:
#!/usr/bin/python2.4
# -*- coding: utf-8 -*-
The code:
f = urllib.urlopen(url)
soup = BeautifulSoup(f)
s = soup.find('div', {'id':'main_count'})
#making a print 's' here goes well. it shows 6Â 918Â 417Â 712
s.replace('Â ','')
save_main_count(s)
It gets no further than s.replace...
Throw out all characters that can't be interpreted as ASCII:
def remove_non_ascii(s):
return "".join(c for c in s if ord(c)<128)
Keep in mind that this is guaranteed to work with the UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).
Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.
See:
http://docs.python.org/tutorial/interpreter.html#source-code-encoding
To enable utf-8 source encoding, this would go in one of the top two lines:
# -*- coding: utf-8 -*-
The above is in the docs, but this also works:
# coding: utf-8
Additional considerations:
The source file must be saved using the correct encoding in your text editor as well.
In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.
s.replace(u"Â ", u"") will also fail if s is not a unicode string.
string.replace returns a new string and does not edit in place, so make sure you're using the return value as well
>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'
The following code will replace all non ASCII characters with question marks.
"".join([x if ord(x) < 128 else '?' for x in s])
Using Regex:
import re
strip_unicode = re.compile("([^-_a-zA-Z0-9!##%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')
Way too late for an answer, but the original string was in UTF-8 and '\xc2\xa0' is UTF-8 for NO-BREAK SPACE. Simply decode the original string as s.decode('utf-8') (\xa0 displays as a space when decoded incorrectly as Windows-1252 or latin-1:
Example (Python 3)
s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE
Output
6 918 417 712
6 918 417 712
6_918_417_712
6-918-417-712
#!/usr/bin/env python
# -*- coding: utf-8 -*-
s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "")
print s
This will print out 6 918 417 712
I know it's an old thread, but I felt compelled to mention the translate method, which is always a good way to replace all character codes above 128 (or other if necessary).
Usage : str.translate(table[, deletechars])
>>> trans_table = ''.join( [chr(i) for i in range(128)] + [' '] * 128 )
>>> 'Résultat'.translate(trans_table)
'R sultat'
>>> '6Â 918Â 417Â 712'.translate(trans_table)
'6 918 417 712'
Starting with Python 2.6, you can also set the table to None, and use deletechars to delete the characters you don't want as in the examples shown in the standard docs at http://docs.python.org/library/stdtypes.html.
With unicode strings, the translation table is not a 256-character string but a dict with the ord() of relevant characters as keys. But anyway getting a proper ascii string from a unicode string is simple enough, using the method mentioned by truppo above, namely : unicode_string.encode("ascii", "ignore")
As a summary, if for some reason you absolutely need to get an ascii string (for instance, when you raise a standard exception with raise Exception, ascii_message ), you can use the following function:
trans_table = ''.join( [chr(i) for i in range(128)] + ['?'] * 128 )
def ascii(s):
if isinstance(s, unicode):
return s.encode('ascii', 'replace')
else:
return s.translate(trans_table)
The good thing with translate is that you can actually convert accented characters to relevant non-accented ascii characters instead of simply deleting them or replacing them by '?'. This is often useful, for instance for indexing purposes.
s.replace(u'Â ', '') # u before string is important
and make your .py file unicode.
This is a dirty hack, but may work.
s2 = ""
for i in s:
if ord(i) < 128:
s2 += i
For what it was worth, my character set was utf-8 and I had included the classic "# -*- coding: utf-8 -*-" line.
However, I discovered that I didn't have Universal Newlines when reading this data from a webpage.
My text had two words, separated by "\r\n". I was only splitting on the \n and replacing the "\n".
Once I looped through and saw the character set in question, I realized the mistake.
So, it could also be within the ASCII character set, but a character that you didn't expect.
my 2 pennies with beautiful soup,
string='<span style="width: 0px> dirty text begin ( ĀĒēāæśḍṣ <0xa0> ) dtext end </span></span>'
string=string.encode().decode('ascii',errors='ignore')
print(string)
will give
<span style="width: 0px> dirty text begin ( ) dtext end </span></span>

Categories