Python 2.5 sub function from regex module not recognizing a pattern - python

I'm trying to use Python's sub function from the regex module to recognize and change a pattern in a string. Below is my code.
old_string = "afdëhë:dfp"
newString = re.sub(ur'([aeiouäëöüáéíóúàèìò]|ù:|e:|i:|o:|u:|ä:|ë:|ö:|ü:|á:|é:|í:|ó:|ú:|à:|è:|ì:|ò:|ù:)h([aeiouäëöüáéíóúàèìòù])', ur'\1\2', old_string)
So what I'm looking to get after the code is applied is afdëë:dfp (without the h). So I'm trying to match a vowel (sometimes with accents, sometimes with a colon after it) then the h then another vowel (sometimes with accents). So a few examples...
ò:ha becomes ò:a
ä:hà becomes ä:hà
aha becomes aa
üha becomes üa
ëhë becomes ëë
So I'm trying to remove the h when it is between two vowels and also remove the h when it follows a volume with a colon after it then another vowel (ie a:ha). Any help is greatly appreciated. I've been playing around with this for a while.

A single user-perceived character may consist of multiple Unicode codepoints. Such characters can break u'[abc]'-like regex that sees only codepoints in Python. To workaround it, you could use u'(?:a|b|c)' regex instead. In addition, don't mix bytes and Unicode strings i.e., old_string should be also Unicode.
Applying the last rule fixes your example.
You could write your regex using lookahead/lookbehind assertions:
# -*- coding: utf-8 -*-
import re
from functools import partial
old_string = u"""
ò:ha becomes ò:a
ä:hà becomes ä:à
aha becomes aa
üha becomes üa
ëhë becomes ëë"""
# (?<=a|b|c)(:?)h(?=a|b|c)
chars = u"a e i o u ä ë ö ü á é í ó ú à è ì ò".split()
pattern = u"(?<=%(vowels)s)(:?)h(?=%(vowels)s)" % dict(vowels=u"|".join(chars))
remove_h = partial(re.compile(pattern).sub, ur'\1')
# remove 'h' followed and preceded by vowels
print(remove_h(old_string))
Output
ò:a becomes ò:a
ä:à becomes ä:à
aa becomes aa
üa becomes üa
ëë becomes ëë
For completeness, you could also normalize all Unicode strings in the program using unicodedata.normalize() function (see the example in the docs, to understand why you might need it).

It is encoding issue. Different combinations of file encoding and old_string being non-unicode behave differently for different pythons.
For example your code works fine for python for 2.6 to 2.7 this way (all data below is cp1252 encoded):
# -*- coding: cp1252 -*-
old_string = "afdëhë:dfp"
but fails with SyntaxError: Non-ASCII character '\xeb' if no encoding specified in file.
However, those lines fails for python 2.5 with
`UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)` for python 2.5
While for all pythons fails to remove h with old_string being non-unicode:
# -*- coding: utf8 -*-
old_string = "afdëhë:dfp"
So you have to provide correct encoding and define old_unicode being unicode string as well, for example this one will do:
# -*- coding: cp1252 -*-
old_string = u"afdëhë:dfp"

Related

How to include Non-ascii characters in regular expression in Python

I have a text file, which I'm reading line by line. In each line, if there are special characters, then I'm removing the special characters, for this, I'm using the help of regular expressions.
fh = open(r"abc.txt","r+")
data = fh.read()
#print re.sub(r'\W+', '', data)
new_str = re.sub('[^a-zA-Z0-9\n\.;,?!$]', ' ', data)
So, here in my data, I'm keeping only the alphanumeric words along with few special symbols which are [.;,?!$], but along with it I also want Euro symbol(€), pound (£), Japanese yen(¥) and Rupee symbol(₹). But these are not present in ASCII characters, so when I include them in my regular expression like - re.sub('[^a-zA-Z0-9\n.;,?!$€₹¥]', ' ', data) it gives an error message.
SyntaxError: Non-ASCII character '\xe2' in file preprocess.py on line 23, but no encoding declared
You can make use of Unicode character escapes. For example, the Euro character above can be represented as \u20ac. The four digit number is the Unicode number, irrespective of encoding types. In an example regex, this might look like:
[^a-zA-Z0-9\u20ac]
Maybe not the solution, but potentially a partial solution. Use this as the first two lines of each of your Python 2 files:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
This makes Python 2 switch to UTF-8 (unicode) mode. In Python 3 this is the default.

How to cope with diacritics while trying to match with regex in Python

Trying to use regular expression with unicode html escapes for diacritics:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
htmlstring=u'''/">čćđš</a>.../">España</a>'''
print re.findall( r'/">(.*?)</a', htmlstring, re.U )
produces :
[u'\u010d\u0107\u0111\u0161', u'Espa\xf1a']
Any help, please?
This appears to be an encoding question. Your code is working as it should. Were you expecting something different? Your strings that are prefixed with u are unicode literals. The characters that begin with \u are unicode characters followed by four hex digits, whereas the characters that begin with \x are unicode characters followed by only two hex digits. If you print out your results (instead of looking at their __repr__ method), you will see that you have received the result that it appears you were looking for:
results = [u'\u010d\u0107\u0111\u0161', u'Espa\xf1a']
for result in results:
print result
čćđš
España
In your code (i.e. in your list), you see the representation of these unicode literals:
for result in results:
print result.__repr__()
u'\u010d\u0107\u0111\u0161' # what shows up in your list
u'Espa\xf1a'
Incidentally, it appears that you are trying to parse html with regexes. You should try BeautifulSoup or something similar instead. It will save you a major headache down the road.

Finding regex, unicode patterns

I'm trying to scrape a website where unicode characters are present. I stated at the very begining -*- coding: utf-8 -*- plus I used the re.UNICODE flag
pattern = re.compile('(?:{}|{})'.format(regex, regex1), re.UNICODE)
However when I print the output I still get those weird chars like �
How do I fix that? Thanks!
Just because a page it has non-latin character doesn't mean it's encoded with unicode (also, which unicode encoding? utf-8? utf-16?).
Additionally, re.UNICODE probably doesn't do what you think it does. From the docs:
Make `\w, \W, \b, \B, \d, \D, \s` and `\S` dependent on the Unicode character properties database.
All this means is that these specific character classes are more broadly defined, it has no effect on the source text.
Moreover, the coding definition, -*- coding: utf-8 -*- is only specifying the encoding of your source file.
Finally, as noted in one of the comments, the � can be the result of using a character which is not supported by the current typeface. This, in turn, can be the result of assuming a certain encoding while the text is encoded in a different encoding.
This may not be an "answer", per-se.. but you could try using http://www.debuggex.com to debug your regexp a bit.

How to get ° character in a string in python?

How can I get a ° (degree) character into a string?
This is the most coder-friendly version of specifying a Unicode character:
degree_sign = u'\N{DEGREE SIGN}'
Escape Sequence: \N{name}
Meaning: Character named name in the Unicode database
Reference: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
Note:
"N" must be uppercase in the \N construct to avoid confusion with the \n newline character
The character name inside the curly braces can be any case
It's easier to remember the name of a character than its Unicode index. It's also more readable, ergo debugging-friendly. The character substitution happens at compile time, i.e. the .py[co] file will contain a constant for u'°':
>>> import dis
>>> c= compile('u"\N{DEGREE SIGN}"', '', 'eval')
>>> dis.dis(c)
1 0 LOAD_CONST 0 (u'\xb0')
3 RETURN_VALUE
>>> c.co_consts
(u'\xb0',)
>>> c= compile('u"\N{DEGREE SIGN}-\N{EMPTY SET}"', '', 'eval')
>>> c.co_consts
(u'\xb0-\u2205',)
>>> print c.co_consts[0]
°-∅
>>> u"\u00b0"
u'\xb0'
>>> print _
°
BTW, all I did was search "unicode degree" on Google. This brings up two results:
"Degree sign U+00B0" and "Degree Celsius U+2103", which are actually different:
>>> u"\u2103"
u'\u2103'
>>> print _
℃
Put this line at the top of your source
# -*- coding: utf-8 -*-
If your editor uses a different encoding, substitute for utf-8
Then you can include utf-8 characters directly in the source
You can also use chr(176) to print the degree sign.
Here is an example using python 3.6.5 interactive shell:
just use \xb0 (in a string); python will convert it automatically
Above answers assume that UTF8 encoding can safely be used - this one is specifically targetted for Windows.
The Windows console normaly uses CP850 encoding and not utf-8, so if you try to use a source file utf8-encoded, you get those 2 (incorrect) characters ┬░ instead of a degree °.
Demonstration (using python 2.7 in a windows console):
deg = u'\xb0` # utf code for degree
print deg.encode('utf8')
effectively outputs ┬░.
Fix: just force the correct encoding (or better use unicode):
local_encoding = 'cp850' # adapt for other encodings
deg = u'\xb0'.encode(local_encoding)
print deg
or if you use a source file that explicitely defines an encoding:
# -*- coding: utf-8 -*-
local_encoding = 'cp850' # adapt for other encodings
print " The current temperature in the country/city you've entered is " + temp_in_county_or_city + "°C.".decode('utf8').encode(local_encoding)
Using python f-string, f"{var}", you can use:
theta = 45
print(f"Theta {theta}\N{DEGREE SIGN}.")
Output: Theta 45°.
*improving tzot answer
Special case: Sending a string to a 16x2 or 20x4 LCD (e.g. attached to a Raspberry Pi with I2C LCD display):
f"Outside Temp: {my_temp_variable}{chr(223)}"

How to make the python interpreter correctly handle non-ASCII characters in string operations?

I have a string that looks like so:
6 918 417 712
The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get:
s.replace('Â ', '')
That should do the trick. But of course it complains that the non-ASCII character '\xc2' in file blabla.py is not encoded.
I never quite could understand how to switch between different encodings.
Here's the code, it really is just the same as above, but now it's in context. The file is saved as UTF-8 in notepad and has the following header:
#!/usr/bin/python2.4
# -*- coding: utf-8 -*-
The code:
f = urllib.urlopen(url)
soup = BeautifulSoup(f)
s = soup.find('div', {'id':'main_count'})
#making a print 's' here goes well. it shows 6Â 918Â 417Â 712
s.replace('Â ','')
save_main_count(s)
It gets no further than s.replace...
Throw out all characters that can't be interpreted as ASCII:
def remove_non_ascii(s):
return "".join(c for c in s if ord(c)<128)
Keep in mind that this is guaranteed to work with the UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).
Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.
See:
http://docs.python.org/tutorial/interpreter.html#source-code-encoding
To enable utf-8 source encoding, this would go in one of the top two lines:
# -*- coding: utf-8 -*-
The above is in the docs, but this also works:
# coding: utf-8
Additional considerations:
The source file must be saved using the correct encoding in your text editor as well.
In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.
s.replace(u"Â ", u"") will also fail if s is not a unicode string.
string.replace returns a new string and does not edit in place, so make sure you're using the return value as well
>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'
The following code will replace all non ASCII characters with question marks.
"".join([x if ord(x) < 128 else '?' for x in s])
Using Regex:
import re
strip_unicode = re.compile("([^-_a-zA-Z0-9!##%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')
Way too late for an answer, but the original string was in UTF-8 and '\xc2\xa0' is UTF-8 for NO-BREAK SPACE. Simply decode the original string as s.decode('utf-8') (\xa0 displays as a space when decoded incorrectly as Windows-1252 or latin-1:
Example (Python 3)
s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE
Output
6 918 417 712
6 918 417 712
6_918_417_712
6-918-417-712
#!/usr/bin/env python
# -*- coding: utf-8 -*-
s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "")
print s
This will print out 6 918 417 712
I know it's an old thread, but I felt compelled to mention the translate method, which is always a good way to replace all character codes above 128 (or other if necessary).
Usage : str.translate(table[, deletechars])
>>> trans_table = ''.join( [chr(i) for i in range(128)] + [' '] * 128 )
>>> 'Résultat'.translate(trans_table)
'R sultat'
>>> '6Â 918Â 417Â 712'.translate(trans_table)
'6 918 417 712'
Starting with Python 2.6, you can also set the table to None, and use deletechars to delete the characters you don't want as in the examples shown in the standard docs at http://docs.python.org/library/stdtypes.html.
With unicode strings, the translation table is not a 256-character string but a dict with the ord() of relevant characters as keys. But anyway getting a proper ascii string from a unicode string is simple enough, using the method mentioned by truppo above, namely : unicode_string.encode("ascii", "ignore")
As a summary, if for some reason you absolutely need to get an ascii string (for instance, when you raise a standard exception with raise Exception, ascii_message ), you can use the following function:
trans_table = ''.join( [chr(i) for i in range(128)] + ['?'] * 128 )
def ascii(s):
if isinstance(s, unicode):
return s.encode('ascii', 'replace')
else:
return s.translate(trans_table)
The good thing with translate is that you can actually convert accented characters to relevant non-accented ascii characters instead of simply deleting them or replacing them by '?'. This is often useful, for instance for indexing purposes.
s.replace(u'Â ', '') # u before string is important
and make your .py file unicode.
This is a dirty hack, but may work.
s2 = ""
for i in s:
if ord(i) < 128:
s2 += i
For what it was worth, my character set was utf-8 and I had included the classic "# -*- coding: utf-8 -*-" line.
However, I discovered that I didn't have Universal Newlines when reading this data from a webpage.
My text had two words, separated by "\r\n". I was only splitting on the \n and replacing the "\n".
Once I looped through and saw the character set in question, I realized the mistake.
So, it could also be within the ASCII character set, but a character that you didn't expect.
my 2 pennies with beautiful soup,
string='<span style="width: 0px> dirty text begin ( ĀĒēāæśḍṣ <0xa0> ) dtext end </span></span>'
string=string.encode().decode('ascii',errors='ignore')
print(string)
will give
<span style="width: 0px> dirty text begin ( ) dtext end </span></span>

Categories