How do I iterate through unicode symbols, not bytes in python? - python

Given the accented unicode word like u'кни́га', I need to strip the acute (u'книга'), and also change the accent format to u'кни+га', where '+' represents the acute over the preceding letter.
What I do now is using a dictionary of acuted and not acuted symbols:
accented_list = [u'я́', u'и́', u'ы́', u'у́', u'э́', u'а́', u'е́', u'ю́', u'о́']
regular_list = [u'я', u'и', u'ы', u'у', u'э', u'а', u'е', u'ю', u'о']
accent_dict = dict(zip(accented_list, regular_list))
I want to do something like this:
def changeAccentFormat(word):
for letter in accent_dict:
if letter in word:
its_index = word.index(letter)
word = word[:its_index + 1] + u'+' + word[its_index + 1:]
return word
But of course it does not work as desired. I noticed that this code:
>>> word = u'кни́га'
>>> for letter in word:
... print letter
gives
к
н
и
´
г
а
(Well, i didn't expect the blank symbol to appear, but nevertheless). So I wonder, what is the simplest way to produce [u'к', u'н', u'и́', u'г', u'а']? Or maybe there is some way to solve my problem without it?

First of all, in regard to iterating over characters instead bytes, you're already doing it right - your word is an unicode object, not an encoded bytestring.
Now, for combination characters in Unicode:
For many characters containing combination characters there is a composed and decomposed form of writing it, the composed being one code point, and the decomposed a sequence of two (or more?) code points:
See U+00E7, U+0063 and U+0327
So in Python, you could either write either form, it will get composed at display time to the same character:
>>> combining_cedilla = u'\u0327'
>>> c_with_cedilla = u'\u00e7'
>>> letter_c = u'\u0063'
>>>
>>> print c_with_cedilla
ç
>>> print letter_c + combining_cedilla
ç
In order to convert between composed and decomposed forms, you can use unicodedata.normalize():
>>> import unicodedata
>>> comp = unicodedata.normalize('NFC', letter_c + combining_cedilla)
>>> decomp = unicodedata.normalize('NFD', c_with_cedilla)
>>>
>>> print comp
ç
>>> print decomp
ç
(NFC stands for "normal form C" (composed), and NFD for "normal form D" (decomposed).
They still are different forms though - one consisting of one code point, the other of two:
>>> comp == decomp
False
>>> len(comp)
1
>>> len(decomp)
2
However, in your case, there simply does not seem to be a combined character for the lowercase и with an accent acute (there is one for и with an accent grave)

You can produce [u'к', u'н', u'и́', u'г', u'а'] with the regex module.
Here is the word you have by each user perceived character:
>>> import regex
>>> word = u'кни́га'
>>> len(word)
6
>>> regex.findall(r'\X', word)
['к', 'н', 'и́', 'г', 'а']
>>> len(regex.findall(r'\X', word))
5

Acutes are represented by codepoint 301, COMBINING ACUTE ACCENT, so a simple string character replacement should suffice:
>>>print u'кни́га'.replace(u'\u0301', "+")
кни+га
If you encounter accented characters that are not encoded with a combining accent, unicodedata.normalize should do the trick

Related

Seperate accents from their letters

I'm looking for a function that will take a compound letter and split it as if you had to type it on a US-INTL keyboard, like so:
'ȯ' becomes ".o"
'â' becomes "^a"
'ë' becomes "\"e"
'è' becomes "`e"
'é' becomes "'e"
'ñ' becomes "~n"
'ç' becomes ",c"
etc.
But when searching for this issue I can only find functions to remove accents entirely, which is not what I want.
Here's what I want to accomplish:
Expand this string:
ër íí àha lá eïsch
into this string:
"er 'i'i `aha l'a e"isch
You can possibly use a dictionary to match the characters with their replacements and then iterate over the string to do the actual replacement.
word_rep = dict(zip(['ȯ','â','ë','è','é','ñ','ç']
['.o','^a','\"e','`e','\'e','~n',',c']))
mystr = 'ër íí àha lá eïsch'
for key,value in word_rep.items():
mystr = mystr.replace(key,value)
Below uses Unicode decomposition to separate combining marks from latin letters, a regular expression to swap the combining character and its letter, then a translation table to convert the combining mark to the key used on the international keyboard:
import unicodedata as ud
import re
replacements = {'\N{COMBINING DOT ABOVE}':'.',
'\N{COMBINING CIRCUMFLEX ACCENT}':'^',
'\N{COMBINING DIAERESIS}':'"',
'\N{COMBINING GRAVE ACCENT}':'`',
'\N{COMBINING ACUTE ACCENT}':"'",
'\N{COMBINING TILDE}':'~',
'\N{COMBINING CEDILLA}':','}
combining = ''.join(replacements.keys())
typing = ''.join(replacements.values())
translation = str.maketrans(combining,typing)
s = 'ër íí àha lá eïsch'
s = ud.normalize('NFD',s)
s = re.sub(rf'([aeiounc])([{combining}])',r'\2\1',s)
s = s.translate(translation)
print(s)
Output:
"er 'i'i `aha l'a e"isch

python 2.7 split unicode text using unicode comma

I have a unicode string (种类:猫; means type:cat)which contains a unicode colon ': ' in Chinese. I would like to separate the string into 2 parts using:
a.split(u': ')
The length of the result list is always 1, so not spliced.
Can someone give me a hint on how to do this type of operation? Thanks!
That's a FULLWIDTH COLON, not an ASCII colon:
>>> s = u'种类:猫'
>>> import unicodedata as ud
>>> for c in s:
... print hex(ord(c)),ud.name(c)
...
0x79cd CJK UNIFIED IDEOGRAPH-79CD
0x7c7b CJK UNIFIED IDEOGRAPH-7C7B
0xff1a FULLWIDTH COLON
0x732b CJK UNIFIED IDEOGRAPH-732B
So you can split it a number of ways:
>>> s.split(u'\uff1a') # by Unicode codepoint
[u'\u79cd\u7c7b', u'\u732b']
>>> s.split(u'\N{FULLWIDTH COLON}') # by name
[u'\u79cd\u7c7b', u'\u732b']
>>> s.split(u':') # Using the correct (single) character
[u'\u79cd\u7c7b', u'\u732b']
Since you are using Python 2.7, to see the output correctly you'll need to print the list items:
>>> for item in s.split(u'\uff1a'):
... print item
...
种类
猫

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

How can I remove non-ASCII characters but leave periods and spaces?

I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:
def onlyascii(char):
if ord(char) < 48 or ord(char) > 127: return ''
else: return char
def get_my_string(file_path):
f=open(file_path,'r')
data=f.read()
f.close()
filtered_data=filter(onlyascii, data)
filtered_data = filtered_data.lower()
return filtered_data
How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.
You can filter all characters from the string that are not printable using string.printable, like this:
>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'
string.printable on my machine contains:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c
EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:
''.join(filter(lambda x: x in printable, s))
An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:
>>>s = u'Good bye in Swedish is Hej d\xe5'
>>>s = s.encode('ascii',errors='ignore')
>>>print s
Good bye in Swedish is Hej d
Edit:
Python3: str -> bytes -> str
>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'
Python2: unicode -> str -> unicode
>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'
Python2: str -> unicode -> str (decode and encode in reverse order)
>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'
According to #artfulrobot, this should be faster than filter and lambda:
import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)
See more examples here Replace non-ASCII characters with a single space
You may use the following code to remove non-English letters:
import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)
This will return
123456790 ABC#%? .()
Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.
Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?
Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:
this is line 1
this is line 2
the result would be 'this is line 1this is line 2' ... is that what you really want?
A greater solution would include:
a better name for the filter function than onlyascii
recognition that a filter function merely needs to return a truthy value if the argument is to be retained:
def filter_func(char):
return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()
Working my way through Fluent Python (Ramalho) - highly recommended.
List comprehension one-ish-liners inspired by Chapter 2:
onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])
If you want printable ascii characters you probably should correct your code to:
if ord(char) < 32 or ord(char) > 126: return ''
this is equivalent, to string.printable (answer from #jterrace), except for the absence of returns and tabs ('\t','\n','\x0b','\x0c' and '\r') but doesnt correspond to the range on your question
this is best way to get ascii characters and clean code, Checks for all possible errors
from string import printable
def getOnlyCharacters(texts):
_type = None
result = ''
if type(texts).__name__ == 'bytes':
_type = 'bytes'
texts = texts.decode('utf-8','ignore')
else:
_type = 'str'
texts = bytes(texts, 'utf-8').decode('utf-8', 'ignore')
texts = str(texts)
for text in texts:
if text in printable:
result += text
if _type == 'bytes':
result = result.encode('utf-8')
return result
text = '�Ahm�����ed Sheri��'
result = getOnlyCharacters(text)
print(result)
#input --> �Ahm�����ed Sheri��
#output --> Ahmed Sheri

Removing control characters from a string in python

I currently have the following code
def removeControlCharacters(line):
i = 0
for c in line:
if (c < chr(32)):
line = line[:i - 1] + line[i+1:]
i += 1
return line
This is just does not work if there are more than one character to be deleted.
There are hundreds of control characters in unicode. If you are sanitizing data from the web or some other source that might contain non-ascii characters, you will need Python's unicodedata module. The unicodedata.category(…) function returns the unicode category code (e.g., control character, whitespace, letter, etc.) of any character. For control characters, the category always starts with "C".
This snippet removes all control characters from a string.
import unicodedata
def remove_control_characters(s):
return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")
Examples of unicode categories:
>>> from unicodedata import category
>>> category('\r') # carriage return --> Cc : control character
'Cc'
>>> category('\0') # null character ---> Cc : control character
'Cc'
>>> category('\t') # tab --------------> Cc : control character
'Cc'
>>> category(' ') # space ------------> Zs : separator, space
'Zs'
>>> category(u'\u200A') # hair space -------> Zs : separator, space
'Zs'
>>> category(u'\u200b') # zero width space -> Cf : control character, formatting
'Cf'
>>> category('A') # letter "A" -------> Lu : letter, uppercase
'Lu'
>>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
'Lo'
>>> category(',') # comma -----------> Po : punctuation
'Po'
>>>
You could use str.translate with the appropriate map, for example like this:
>>> mpa = dict.fromkeys(range(32))
>>> 'abc\02de'.translate(mpa)
'abcde'
Anyone interested in a regex character class that matches any Unicode control character may use [\x00-\x1f\x7f-\x9f].
You may test it like this:
>>> import unicodedata, re, sys
>>> all_chars = [chr(i) for i in range(sys.maxunicode)]
>>> control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
>>> expanded_class = ''.join(c for c in all_chars if re.match(r'[\x00-\x1f\x7f-\x9f]', c))
>>> control_chars == expanded_class
True
So to remove the control characters using re just use the following:
>>> re.sub(r'[\x00-\x1f\x7f-\x9f]', '', 'abc\02de')
'abcde'
This is the easiest, most complete, and most robust way I am aware of. It does require an external dependency, however. I consider it to be worth it for most projects.
pip install regex
import regex as rx
def remove_control_characters(str):
return rx.sub(r'\p{C}', '', 'my-string')
\p{C} is the unicode character property for control characters, so you can leave it up to the unicode consortium which ones of the millions of unicode characters available should be considered control. There are also other extremely useful character properties I frequently use, for example \p{Z} for any kind of whitespace.
Your implementation is wrong because the value of i is incorrect. However that's not the only problem: it also repeatedly uses slow string operations, meaning that it runs in O(n2) instead of O(n). Try this instead:
return ''.join(c for c in line if ord(c) >= 32)
And for Python 2, with the builtin translate:
import string
all_bytes = string.maketrans('', '') # String of 256 characters with (byte) value 0 to 255
line.translate(all_bytes, all_bytes[:32]) # All bytes < 32 are deleted (the second argument lists the bytes to delete)
You modify the line during iterating over it. Something like ''.join([x for x in line if ord(x) >= 32])
filter(string.printable[:-5].__contains__,line)
I've tried all the above and it didn't help. In my case, I had to remove Unicode 'LRM' chars:
Finally I found this solution that did the job:
df["AMOUNT"] = df["AMOUNT"].str.encode("ascii", "ignore")
df["AMOUNT"] = df["AMOUNT"].str.decode('UTF-8')
Reference here.

Categories