Stripping a unicode text of whatever is not a character - python

I'm trying to write a simple Python script which takes a text file as an input, deletes every non-literal character, and writes the output in another file.
Normally I would have done two ways:
use a regular expression combined with re.sub to replace every non letter character with empty strings
examine every char in every line and write it to the output only if it was in string.lowercase
But this time the text is The Divine Comedy in Italian (I'm Italian), so there are some Unicode characters like
èéï
and some others. I wrote # -*- coding: utf-8 -*- as the first line of the script, but what I got is that Python doesn't signal errors when Unicode chars are written inside the script.
Then I tried to include Unicode chars in my regular expression, writing them as, for example:
u'\u00AB'
and it seems to work, but Python, when reading input from a file, doesn't rewrite what it read the same way it read it. For example, some characters get converted into square root symbol.
What should I do?

unicodedata.category(unichr) will return the category of that code-point.
You can find a description of the categories at unicode.org but the ones relevant to you are the L, N, P, Z and maybe S groups:
Lu Uppercase_Letter an uppercase letter
Ll Lowercase_Letter a lowercase letter
Lt Titlecase_Letter a digraphic character, with first part uppercase
Lm Modifier_Letter a modifier letter
Lo Other_Letter other letters, including syllables and ideographs
...
You might also want to normalize your string first so that diacriticals that can attach to letters do so:
unicodedata.normalize(form, unistr)
Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
Putting all this together:
file_bytes = ... # However you read your input
file_text = file_bytes.decode('UTF-8')
normalized_text = unicodedata.normalize('NFC', file_text)
allowed_categories = set([
'Ll', 'Lu', 'Lt', 'Lm', 'Lo', # Letters
'Nd', 'Nl', # Digits
'Po', 'Ps', 'Pe', 'Pi', 'Pf', # Punctuation
'Zs' # Breaking spaces
])
filtered_text = ''.join(
[ch for ch in normalized_text
if unicodedata.category(ch) in allowed_categories])
filtered_bytes = filtered_text.encode('UTF-8') # ready to be written to a file

import codecs
f = codecs.open('FILENAME', encoding='utf-8')
for line in f:
print repr(line)
print line
1. Will Give you Unicode Formation
2. Will Give you as per written in your file.
Hopefully It will Help you :)

Related

Seperate accents from their letters

I'm looking for a function that will take a compound letter and split it as if you had to type it on a US-INTL keyboard, like so:
'ȯ' becomes ".o"
'â' becomes "^a"
'ë' becomes "\"e"
'è' becomes "`e"
'é' becomes "'e"
'ñ' becomes "~n"
'ç' becomes ",c"
etc.
But when searching for this issue I can only find functions to remove accents entirely, which is not what I want.
Here's what I want to accomplish:
Expand this string:
ër íí àha lá eïsch
into this string:
"er 'i'i `aha l'a e"isch
You can possibly use a dictionary to match the characters with their replacements and then iterate over the string to do the actual replacement.
word_rep = dict(zip(['ȯ','â','ë','è','é','ñ','ç']
['.o','^a','\"e','`e','\'e','~n',',c']))
mystr = 'ër íí àha lá eïsch'
for key,value in word_rep.items():
mystr = mystr.replace(key,value)
Below uses Unicode decomposition to separate combining marks from latin letters, a regular expression to swap the combining character and its letter, then a translation table to convert the combining mark to the key used on the international keyboard:
import unicodedata as ud
import re
replacements = {'\N{COMBINING DOT ABOVE}':'.',
'\N{COMBINING CIRCUMFLEX ACCENT}':'^',
'\N{COMBINING DIAERESIS}':'"',
'\N{COMBINING GRAVE ACCENT}':'`',
'\N{COMBINING ACUTE ACCENT}':"'",
'\N{COMBINING TILDE}':'~',
'\N{COMBINING CEDILLA}':','}
combining = ''.join(replacements.keys())
typing = ''.join(replacements.values())
translation = str.maketrans(combining,typing)
s = 'ër íí àha lá eïsch'
s = ud.normalize('NFD',s)
s = re.sub(rf'([aeiounc])([{combining}])',r'\2\1',s)
s = s.translate(translation)
print(s)
Output:
"er 'i'i `aha l'a e"isch

How to match a string to another string ignoring the special character and space?

I am trying to match a value of a json file in my python code to another value of another API call within the same code itself. The values basically the same but it does not match because sometimes special characters or trailing/ending space causes a problem
Let's say:
value in the first json file:
json1['org'] = google, LLC
value in the second json file:
json2['org'] = Google-LLC
Tried using an in operator in the code but it does not work. I am not sure how can I inculcate regex into this one.
So I write an if statement like this:
if json1['org'] in json2['org']:
# *do something*
else:
# _do the last thing_
It just keeps on jumping on the else statement even though they are the same.
If the json values are same anyhow regardless of special characters and space, it should match and enter the if statement.
You could remove all 'special characters/spaces' and compare the values:
import string
asciiAndNumbers = string.ascii_letters + string.digits
json1 = {'org': "google, LLC"}
json2 = {'org': "Google-LLC"}
def normalizedText(text):
# We are just allowing a-z, A-Z and 0-9 and use lowercase characters
return ''.join(c for c in text if c in asciiAndNumbers).lower()
j1 = normalizedText(json1['org'])
j2 = normalizedText(json2['org'])
print (j1)
print (j1 == j2)
Prints:
googlellc
True

Text Pre-processing + Python + CSV : Removing special characters from a column of a CSV

I am working on a text classification problem. My CSV file contains a column called 'description' which describes events. Unfortunately, that column is full of special characters apart from English words. Sometimes the entire field in a row is full of such characters, or, sometimes, few words are of such special characters and the rest are English words. I am showing you two specimen fields of two different rows:
हर वर्ष की तरह इस वर्ष भी सिंधु सेना द्वारा आयोजित सिंधी प्रीमियर लीग फुटबॉल टूर्नामेंट का आयोजन एमबीएम ग्राउंड में करने जा रही है जिसमें अंडर-19 टीमें भाग लेती है आप सभी से निवेदन है समाज के युवाओं को प्रोत्साहन करने अवश्य पधारें
Unwind on the strums of Guitar & immerse your soul into the magical vibes of music! ️? ️?..Guitar Night By Ashmik Patil.July 19, 2018.Thursday.9 PM Onwards.*Cover charges applicable...#GuitarNight #MusicalNight #MagicalMusic #MusicLove #Party #Enjoy #TheBarTerminal #Mumbaikars #Mumbai
In the first one the entire field is full of such unreadable characters, whereas in the second case, only few such characters are present. Rest of them are English words.
I want to remove only those special chars keeping the English words as they are, as I need those English words to form a bag of words at a later stage.
How to implement that with Python ( I am using a jupyter notebook) ?
You can do this by using regex. Assuming that you have been able to take out the text from the CSV file -
#python 2.7
import re
text = "Something with special characters á┬ñ┬╡├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬╖"
cleaned_text = re.sub(r'[^\x00-\x7f]+','', text)
print cleaned_text
Output - Something with special characters
To understand the regex expression used, refer here.
You can encode your string to ascii and ignore the errors.
>>> text = 'Something with special characters á┬ñ┬╡├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬╖'
>>> text = text.encode('ascii', 'ignore')
Which will give you a binary object, which you can further decode again to utf
>>> text
b'Something with special characters '
>>> text = text.decode('utf')
>>> text
'Something with special characters '
You could use pandas to read the csv file into a dataframe. using:
import pandas as pd
df = pd.read_csv(fileName,convertor={COLUMN_NUMBER:func})
where func, is a function that takes a single string and removes special characters.
this can be done in different ways, using regex, but here is a simple one
import string
def func(strg):
return ''.join(c for c in strg if c in string.printable[:-5])
alternatively you can read the dataframe first then apply to alter the description column. ie.
import pandas as pd
df = pd.read_csv(fileName)
df['description'] = df['description'].apply(func)
or using regex
df['description'] = df['description'].str.replace('[^A-Za-z _]','')
string.printable[:-5 ] is the set of characters '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~ '

UTF-8 decoding with ascii code in it with Python

From the question and answer in UTF-8 coding in Python, I could use binascii package to decode an utf-8 string with '_' in it.
def toUtf(r):
try:
rhexonly = r.replace('_', '')
rbytes = binascii.unhexlify(rhexonly)
rtext = rbytes.decode('utf-8')
except TypeError:
rtext = r
return rtext
This code works fine with only utf-8 characters:
r = '_ed_8e_b8'
print toUtf(r)
>> 편
However, this code does not work when the string has normal ascii code in it. The ascii can be anywhere in the string.
r = '_2f119_ed_8e_b8'
print toUtf(r)
>> doesn't work - _2f119_ed_8e_b8
>> this should be '/119편'
Maybe, I can use regular expression to extract the utf-8 part and ascii part to reassmeble after the conversion, but I wonder if there is an easier way to do the conversion. Any good solution?
Quite straightforward with re.sub:
import re
bytegroup = r'(_[0-9a-z]{2})+'
def replacer(match):
return toUtf(match.group())
rtext = re.sub(bytegroup, replacer, r, flags=re.I)
That is some truly terrible input you've got. It's still fixable though. First off, replace the non-"encoded" stuff with hex equivalents:
import itertools
import re
r = '_2f119_ed_8e_b8'
# Split so you have even entries in the list as ASCII, odd as hex encodings
rsplit = re.split(r'((?:_[0-9a-fA-F]{2})+)', r) # ['', '_2f', '119', '_ed_8e_b8', '']
# Process the hex encoded UTF-8 with your existing function, leaving
# ASCII untouched
rsplit[1::2] = map(toUtf, rsplit[1::2]) # ['', '/', '119', '관', '']
rtext = ''.join(rsplit) # '/119편'
The above is a verbose version that shows the individual steps, but as chthonicdaemon's answer point's out, it can be shortened dramatically. You use the same regular expression with re.sub instead of re.split, and pass a function to perform the replacement instead of a replacement pattern string:
# One-liner equivalent to the above with no intermediate lists
rtext = re.sub(r'(?:_[0-9a-f]{2})+', lambda m: toUtf(m.group()), r, flags=re.I)
You can package that up as a function itself, so you have one function that deals with purely hex encoded UTF-8, and a second general function that uses the first function as part of processing mixed non-encoded ASCII and hex encoded UTF-8 data.
Mind you, this won't necessarily work all that well if the non-encoded ASCII might contain _ normally; the regex tries to be as targeted as possible, but you've got a problem here where no matter how finely you target your heuristics, some ASCII data will be mistaken for encoded UTF-8 data.

Python - Unicode

The execution of a simple script is not going as thought.
notAllowed = {"â":"a", "à":"a", "é":"e", "è":"e", "ê":"e",
"î":"i", "ô":"o", "ç":"c", "û":"u"}
word = "dôzerté"
print word
for char in word:
if char in notAllowed.keys():
print "hooray"
word = word.replace(char, notAllowed[char])
print word
print "finished"
The output return the word unchanged, even though it should have changed "ô" and "é" to o and e, thus returning dozerte...
Any ideas?
How about:
# -*- coding: utf-8 -*-
notAllowed = {u"â":u"a", u"à":u"a", u"é":u"e", u"è":u"e", u"ê":u"e",
u"î":u"i", u"ô":u"o", u"ç":u"c", u"û":u"u"}
word = u"dôzerté"
print word
for char in word:
if char in notAllowed.keys():
print "hooray"
word = word.replace(char, notAllowed[char])
print word
print "finished"
Basically, if you want to assign an unicode string to some variable you need to use:
u"..."
#instead of just
"..."
to denote the fact that this is the unicode string.
Iterating a string iterates its bytes, not necessarily its characters. If the encoding of your python source file is utf-8, len(word) will be 9 insted of 7 (both special characters have a two-byte encoding). Iterating a unicode string (u"dôzerté") iterates characters, so that should work.
May I also suggest you use unidecode for the task you're trying to achieve?

Categories