Unicode issues when using NLTK - python

I have a text scraped from internet (I think it was a Spanish text encoded in "latin-1" and decoded to unicode when scraped). The text is something like this:
730\u20ac.\r\n\nropa nueva 2012 ... 5,10 muy buen estado..... 170 \u20ac\r\n\nPack 850\u20ac,
After that I do some replacements on the text to normalize some words (i.e. replace the € symbol (\u20ac) for "euros" using regex (r'\u20ac', r' euros')).
Here my problem seems to start... If I do not encode each string to "UTF-8" before applying the regex, the regex wont find any occurrences (despite a lot of occurrences do exist)...
Anyways, after encoding it to UTF-8, the regex (r'\u20ac', r' euros') works.
After that I tokenize and tag all the strings. When I try to use the regexparser I then get the
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
My question is, if I have already encoded it to UTF-8, how come I have a problem now? And what would be your suggestion to try to avoid it?
Is there a way to do the encoding process once and for all, like below? If so what should I do for the second part (encode/ decode it anyway)?
Get text -> encode/ decode it anyway... -> Work on the text without any issue
Thanks in advance for any help!! I am new to programming and it is killing me...
Code detail:
regex function
replacement_patterns = [(ur' \\u20ac', ur' euros'),(ur' \xe2\x82\xac', r' euros'),(ur' \b[eE]?[uU]?[rR]\b', r' euros'), (ur' \b([0-9]+)[eE][uU]?[rR]?[oO]?[sS]?\b',ur' \1 euros')]
class RegexpReplacer(object):
def __init__(self, patterns=replacement_patterns):
self.patterns = [(re.compile(regex, re.IGNORECASE), repl) for (regex, repl) in patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.patterns:
(s, count) = re.subn(pattern, repl, s)
return s

You seem to be misunderstanding the meaning of r'\u20ac'
The r indicates a raw string. Not a unicode string, a standard one. So using a unicode escape in a pattern only gets you a literal backslash:
>>> p = re.compile(r'\u20ac')
>>> p.pattern
'\\u20ac'
>>> print p.pattern
\u20ac
If you want to use raw strings and unicode escapes, you'll have to use raw unicode strings, indicated by ur instead of just r:
>>> p = re.compile(ur'\u20ac')
>>> p.pattern
u'\u20ac'
>>> print p.pattern
€

Did you use the decode & encode functions correctly?
from nltk import ne_chunk,pos_tag
from nltk.tokenize.punkt import PunktSentenceTokenizer
from nltk.tokenize.treebank import TreebankWordTokenizer
text = "€"
text = text.decode('utf-8')
sentences = PunktTokenizer.tokenize(text)
tokens = [TreeBankTokenizer.tokenize(sentence) for sentence in sentences]
tagged = [pos_tag(token) for token in tokens]
When needed, try to use:
print your_string.encode("utf-8")
I have no problems currently. The only issue is that $50, says:
word: $ meaning: dollar word: 50 meaning: numeral, cardinal
This is correct.
And €50, says:
word: €50 meaning: -NONE-
This is INcorrect.
With a space between the € sign and the number, it says:
word: € meaning: noun, common, singular or mass word: 50 meaning:
numeral, cardinal
Which is more correct.

Related

Character classes using byte regex for characters encoded with multiple bit blocks

I would like to use regular expressions on bytestrings in python of which I know the encoding (utf-8). I am facing difficulties trying to use character classes that involve characters that are encoded using more than one bit block. They appear to become two or more 'characters' that are matched separately in the character class.
Performing the search on (unicode) strings instead is possible, but I would like to know if there is a solution to defining character classes for the case of bytestrings as well. Maybe it's just not possible!?
Below is a python 3 example that shows what happens when I try to replace different line breaks with '\n':
import re
def show_pattern(pattern):
print(f"\nPattern repr:\t{repr(pattern)}")
def test_sub(pattern, replacement, text):
print(f"Before repr:\t{repr(text)}")
result = re.sub(pattern, replacement, text)
print(f"After repr:\t{repr(result)}")
# Pattern for line breaks
PATTERN = '[' + "\u000A\u000B\u000C\u000D\u0085\u2028\u2029" + ']'
REPLACEMENT = '\n'
TEXT = "How should I replace my unicode string\u2028using utf-8-encoded bytes?"
show_pattern(PATTERN)
test_sub(PATTERN, REPLACEMENT, TEXT)
# expected output:
# Pattern repr: '[\n\x0b\x0c\r\x85\u2028\u2029]'
# Before repr: 'How should I replace my unicode string\u2028using utf-8-encoded bytes?'
# After repr: 'How should I replace my unicode string\nusing utf-8-encoded bytes?'
ENCODED_PATTERN = PATTERN.encode('utf-8')
ENCODED_REPLACEMENT = REPLACEMENT.encode('utf-8')
ENCODED_TEXT = TEXT.encode('utf-8')
show_pattern(ENCODED_PATTERN)
test_sub(ENCODED_PATTERN, ENCODED_REPLACEMENT, ENCODED_TEXT)
# expected output:
# Pattern repr: b'[\n\x0b\x0c\r\xc2\x85\xe2\x80\xa8\xe2\x80\xa9]'
# Before repr: b'How should I replace my unicode string\xe2\x80\xa8using utf-8-encoded bytes?'
# After repr: b'How should I replace my unicode string\n\n\nusing utf-8-encoded bytes?'
In the encoded version, I end up with three '\n''s instead of one. Similar things happen for a more complicated document where it's not obvious what the correct output should be.
You may use an alternation based pattern rather than a character class, as you will want to match sequences of bytes:
PATTERN = "|".join(['\u000A','\u000B','\u000C','\u000D','\u0085','\u2028','\u2029'])
See the online demo.
If you prefer to initialize the pattern from a string use
CHARS = "\u000A\u000B\u000C\u000D\u0085\u2028\u2029"
PATTERN = "|".join(CHARS)

Remove accented characters form string - Python

I get some data from a webpage and read it like this in python
origional_doc = urllib2.urlopen(url).read()
Sometimes this url has characters such as é and ä and ect., how could I remove these characters, from the string, right now this is what I am trying,
import unicodedata
origional_doc = ''.join((c for c in unicodedata.normalize('NFD', origional_doc) if unicodedata.category(c) != 'Mn'))
But I get an error
TypeError: must be unicode, not str
This should work. It will eliminate all characters that are not ascii.
original_doc = (original_doc.decode('unicode_escape').encode('ascii','ignore'))
using re you can sub all characters that are in a certain hexadecimal ascii range.
>>> re.sub('[\x80-\xFF]','','é and ä and ect')
' and and ect'
You can also do the inverse and sub anything thats NOT in the basic 128 characters:
>>> re.sub('[^\x00-\x7F]','','é and ä and ect')
' and and ect'

How to remove special characters from strings in python?

I have millions of strings scraped from web like:
s = 'WHAT\xe2\x80\x99S UP DOC?'
type(s) == str # returns True
Special characters like in the string above are inevitable when scraping from the web. How should one remove all such special characters to retain just clean text? I am thinking of regular expression like this based on my very limited experience with unicode characters:
\\x.*[0-9]
The special characters are not actually multiple characters long, that is just how they are represented so your regex isn't going to work. If you print you will see the actual unicode (utf-8) characters
>>> s = 'WHAT\xe2\x80\x99S UP DOC?'
>>> print(s)
WHATâS UP DOC?
>>> repr(s)
"'WHATâ\\x80\\x99S UP DOC?'"
If you want to print only the ascii characters you can check if the character is in string.printable
>>> import string
>>> ''.join(i for i in s if i in string.printable)
'WHATS UP DOC?'
This thing worked for me as mentioned by Padriac in comments:
s.decode('ascii', errors='ignore')

Python Polish character encoding issues

I'm having some issues with character encoding, and in this special case with Polish characters.
I need to replace all none windows-1252 characters with a windows-1252 equivalent. I had this working until I needed to work with Polish characters. How can I replace these characters?
The é for example is a windows-1252 character and must stay this way. But the ł is not a windows-1252 character and must be replaced with its equivalent (or stripped if it hasn't a equivalent).
I tried this:
import unicodedata
text = "Racławicka Rógé"
tmp = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(tmp.decode("utf-8"))
This prints:
Racawicka Roge
But now the ó and é are both encoded to o and e.
How can I get this right?
If you want to move to 1252, that's what you should tell encode and decode:
>>> text = "Racławicka Rógé"
>>> text.encode('1252', 'ignore').decode('1252')
'Racawicka Rógé'
If you are not handling with big texts, just like your example, you can make use of Unidecode library with the solution provided by jonrsharpe.
from unidecode import unidecode
text = u'Racławicka Rógé'
result = ''
for i in text:
try:
result += i.encode('1252').decode('1252')
except (UnicodeEncodeError, UnicodeDecodeError):
result += unidecode(i)
print result # which will be 'Raclawicka Rógé'

How to account for accent characters for regex in Python?

I currently use re.findall to find and isolate words after the '#' character for hash tags in a string:
hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1)
It searches str1 and finds all the hashtags. This works however it doesn't account for accented characters like these for example: áéíóúñü¿.
If one of these letters are in str1, it will save the hashtag up until the letter before it. So for example, #yogenfrüz would be #yogenfr.
I need to be able to account for all accented letters that range from German, Dutch, French and Spanish so that I can save hashtags like #yogenfrüz
How can I go about doing this
Try the following:
hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)
Regex101 Demo
EDIT
Check the useful comment below from Martijn Pieters.
I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.
hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)
which will return ['yogenfrüz']
Hope this'll help anyone else.
You may also want to use
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?
Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
Explicit example...
myfoo = u'àà'
myfoo
u'\xe0\xe0'
unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?
Here's an update to Ibrahim Najjar's original answer based on the comment Martijn Pieters made to the answer and another answer Martijn Pieters gave in https://stackoverflow.com/a/16467505/5302861:
import re
import unicodedata
s = "#ábá123"
n = unicodedata.normalize('NFC', s)
print(n)
c = ''.join(re.findall(r'#\w+', n, re.UNICODE))
print(s, len(s), c, len(c))

Categories