I've got an issue with iterating through unicode strings, character by character, with python.
print "w: ",word
for c in word:
print "word: ",c
This is my output
w: 文本
word: ?
word: ?
word: ?
word: ?
word: ?
word: ?
My desired output is:
文
本
When I use len(word) I get 6. Apparently each character is 3 unicode chunks.
So, my unicode string is successfully stored in the variable, but I cannot get the characters out. I have tried using encode('utf-8'), decode('utf-8) and codecs but still cannot obtain any good results. This seems like a simple problem but is frustratingly hard for me.
Hope someone can point me to the right direction.
Thanks!
# -*- coding: utf-8 -*-
word = "文本"
print(word)
for each in unicode(word,"utf-8"):
print(each)
Output:
文本
文
本
The code I used which works is this
fileContent = codecs.open('fileName.txt','r',encoding='utf-8')
#...split by whitespace to get words..
for c in word:
print(c.encode('utf-8'))
you should convert the word from string type to unicode:
print "w: ",word
for c in word.decode('utf-8'):
print "word: ",c
For python 3 this is what works:
import unicodedata
word = "文本"
word = unicodedata.normalize('NFC', word)
for char in word:
print(char)
Related
I want to find out if a substring is contained in the string and remove it from it without touching the rest of the string. The thing is that the substring pattern that I have to perform the search on is not exactly what will be contained in the string. In particular the problem is due to spanish accent vocals and, at the same time, uppercase substring, so for example:
myString = 'I'm júst a tésting stríng'
substring = 'TESTING'
Perform something to obtain:
resultingString = 'I'm júst a stríng'
Right now I've read that difflib library can compare two strings and weight it similarity somehow, but I'm not sure how to implement this for my case (without mentioning that I failed to install this lib).
Thanks!
This normalize() method might be a little overkill and maybe using the code from #Harpe at https://stackoverflow.com/a/71591988/218663 works fine.
Here I am going to break the original string into "words" and then join all the non-matching words back into a string:
import unicodedata
def normalize(text):
return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower()
myString = "I'm júst a tésting stríng"
substring = "TESTING"
newString = " ".join(word for word in myString.split(" ") if normalize(word) != normalize(substring))
print(newString)
giving you:
I'm júst a stríng
If your "substring" could be multi-word I might think about switching strategies to a regex:
import re
import unicodedata
def normalize(text):
return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower()
myString = "I'm júst á tésting stríng"
substring = "A TESTING"
match = re.search(f"\\s{ normalize(substring) }\\s", normalize(myString))
if match:
found_at = match.span()
first_part = myString[:found_at[0]]
second_part = myString[found_at[1]:]
print(f"{first_part} {second_part}".strip())
I think that will give you:
I'm júst stríng
You can use the package unicodedata to normalize accented letters to ascii code letters like so:
import unicodedata
output = unicodedata.normalize('NFD', "I'm júst a tésting stríng").encode('ascii', 'ignore')
print(str(output))
which will give
b"I'm just a testing string"
You can then compare this with your input
"TESTING".lower() in str(output).lower()
which should return True.
(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.
# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )
# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)
# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"
Is this a bug ? Have I misunderstood something ?
Any help would be appreciated !
PS : I edited my post thanks to the Michael Foukarakis' remark.
I guess the actual string you need to process is mystring = €\\n?
mystring = "€\n" # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"
I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.
How we did is to bypass the encoder("utf_8") after the escape procedure is done.
>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n' # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n' # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'
We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"
And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.
And you can get the correct str from this bytes object, just put it into str()
BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.
User input:\n\x61\x62\n\x20\x21 # 20 characters, which present 6 chars semantically
output: # \n
ab # \x61\x62\n
! # \x20\x21
That's a powerful tool for user to input some non-printable character in terminal.
Our final tools is:
#!/usr/bin/env python3
import sys
for line in sys.stdin:
sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
sys.stdout.flush()
You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.
Firstly, let's look at the documentation for unicode_escape, which states:
Produce[s] a string that is suitable as Unicode literal in Python source code.
Here is what you would get from the network or a file that claims its contents are Unicode escaped:
b'\\u20ac\\n'
Now, you have to decode this to use it in your app:
>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'
and if you wanted to write it back to, say, a Python source file:
with open('/tmp/foo', 'wb') as fh: # binary mode
fh.write(b'print("' + s.encode('unicode_escape') + b'")')
import string
printable = string.printable
printable = printable + '€'
def cod(c):
return c.encode('unicode_escape').decode('ascii')
def unescape(s):
return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)
mystring = "€\n"
print(unescape(mystring))
Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.
The execution of a simple script is not going as thought.
notAllowed = {"â":"a", "à":"a", "é":"e", "è":"e", "ê":"e",
"î":"i", "ô":"o", "ç":"c", "û":"u"}
word = "dôzerté"
print word
for char in word:
if char in notAllowed.keys():
print "hooray"
word = word.replace(char, notAllowed[char])
print word
print "finished"
The output return the word unchanged, even though it should have changed "ô" and "é" to o and e, thus returning dozerte...
Any ideas?
How about:
# -*- coding: utf-8 -*-
notAllowed = {u"â":u"a", u"à":u"a", u"é":u"e", u"è":u"e", u"ê":u"e",
u"î":u"i", u"ô":u"o", u"ç":u"c", u"û":u"u"}
word = u"dôzerté"
print word
for char in word:
if char in notAllowed.keys():
print "hooray"
word = word.replace(char, notAllowed[char])
print word
print "finished"
Basically, if you want to assign an unicode string to some variable you need to use:
u"..."
#instead of just
"..."
to denote the fact that this is the unicode string.
Iterating a string iterates its bytes, not necessarily its characters. If the encoding of your python source file is utf-8, len(word) will be 9 insted of 7 (both special characters have a two-byte encoding). Iterating a unicode string (u"dôzerté") iterates characters, so that should work.
May I also suggest you use unidecode for the task you're trying to achieve?
I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:
def onlyascii(char):
if ord(char) < 48 or ord(char) > 127: return ''
else: return char
def get_my_string(file_path):
f=open(file_path,'r')
data=f.read()
f.close()
filtered_data=filter(onlyascii, data)
filtered_data = filtered_data.lower()
return filtered_data
How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.
You can filter all characters from the string that are not printable using string.printable, like this:
>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'
string.printable on my machine contains:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c
EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:
''.join(filter(lambda x: x in printable, s))
An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:
>>>s = u'Good bye in Swedish is Hej d\xe5'
>>>s = s.encode('ascii',errors='ignore')
>>>print s
Good bye in Swedish is Hej d
Edit:
Python3: str -> bytes -> str
>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'
Python2: unicode -> str -> unicode
>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'
Python2: str -> unicode -> str (decode and encode in reverse order)
>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'
According to #artfulrobot, this should be faster than filter and lambda:
import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)
See more examples here Replace non-ASCII characters with a single space
You may use the following code to remove non-English letters:
import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)
This will return
123456790 ABC#%? .()
Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.
Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?
Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:
this is line 1
this is line 2
the result would be 'this is line 1this is line 2' ... is that what you really want?
A greater solution would include:
a better name for the filter function than onlyascii
recognition that a filter function merely needs to return a truthy value if the argument is to be retained:
def filter_func(char):
return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()
Working my way through Fluent Python (Ramalho) - highly recommended.
List comprehension one-ish-liners inspired by Chapter 2:
onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])
If you want printable ascii characters you probably should correct your code to:
if ord(char) < 32 or ord(char) > 126: return ''
this is equivalent, to string.printable (answer from #jterrace), except for the absence of returns and tabs ('\t','\n','\x0b','\x0c' and '\r') but doesnt correspond to the range on your question
this is best way to get ascii characters and clean code, Checks for all possible errors
from string import printable
def getOnlyCharacters(texts):
_type = None
result = ''
if type(texts).__name__ == 'bytes':
_type = 'bytes'
texts = texts.decode('utf-8','ignore')
else:
_type = 'str'
texts = bytes(texts, 'utf-8').decode('utf-8', 'ignore')
texts = str(texts)
for text in texts:
if text in printable:
result += text
if _type == 'bytes':
result = result.encode('utf-8')
return result
text = '�Ahm�����ed Sheri��'
result = getOnlyCharacters(text)
print(result)
#input --> �Ahm�����ed Sheri��
#output --> Ahmed Sheri
Does something exist that can take as input U+0043 and produce as output the letter C, maybe even a small description of the character ( like LATIN CAPITAL LETTER C )?
EDIT: the U+0043 is just an example. I would like a generic solution please, that could work for as many codepoints as possible.
unicodedata.name looks promising. You need a bit of (trivial) parsing, of course, if you have a string input like U+0043.
The hackish way:
import unicodedata
codepoint = b"U+0043"
char = codepoint.replace('U+', "\u").decode('unicode-escape')
# or char = unichr(int(codepoint.replace('U+', ''), 16))
print char
print unicodedata.name(char)
import unicodedata
print unicodedata.name(u'C') # or unicodedata.name(u'\u0043')
# LATIN CAPITAL LETTER C
You could do chr(0x43) do get C.