This question already has answers here:
Why do I get the u"xyz" format when I print a list of unicode strings in Python?
(3 answers)
Closed 9 years ago.
Python doesn't seem to be working with Arabic letters here in the code below. Any ideas?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import nltk
sentence = "ورود ممنوع"
tokens = nltk.word_tokenize(sentence)
print tokens
the result is:
>>>
['\xd9\x88\xd8\xb1\xd9\x88\xd8\xaf', '\xd9\x85\xd9\x85\xd9\x86\xd9\x88\xd8\xb9']
>>>
I also tried adding a u before the string, but it didn't help:
>>> u"ورود ممنوع">>>
['\xd9\x88\xd8\xb1\xd9\x88\xd8\xaf', '\xd9\x85\xd9\x85\xd9\x86\xd9\x88\xd8\xb9']
You have correct results in list with byte strings:
>>> lst = ['\xd9\x88\xd8\xb1\xd9\x88\xd8\xaf',
'\xd9\x85\xd9\x85\xd9\x86\xd9\x88\xd8\xb9']
>>> for l in lst:
... print l
...
ورود
ممنوع
to convert it to unicode you can use list comprehantion:
>>> lst = [e.decode('utf-8') for e in lst]
>>> lst
[u'\u0648\u0631\u0648\u062f', u'\u0645\u0645\u0646\u0648\u0639']
Printing Unicode Char inside a List
Related
This question already has answers here:
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 2 years ago.
Normally I use unicodedata to normalize other latin-ish text. However, I've come across this and not sure what to do:
s = 'Nguyễn Văn Trỗi'
>>> unicodedata.normalize('NFD', s)
'Nguyễn Văn Trỗi'
Is there another module that can normalize more accents than unicodedata ? The output I want is:
Nguyen Van Troi
normalize doesn't mean "remove accents". It is converting between composed and decomposed forms:
>>> import unicodedata as ud
>>> a = 'ă'
>>> print(ascii(ud.normalize('NFD',a))) # LATIN SMALL LETTER A + COMBINING BREVE
'a\u0306'
>>> print(ascii(ud.normalize('NFC',a))) # LATIN SMALL LETTER A WITH BREVE
'\u0103'
One way to remove them is to then encode the decomposed form as ASCII ignoring errors, which works because combining characters are not ASCII. Note, however, that not all international characters have decomposed forms, such as đ.
>>> s = 'Nguyễn Văn Trỗi'
>>> ud.normalize('NFD',s).encode('ascii',errors='ignore').decode('ascii')
'Nguyen Van Troi'
>>> s = 'Ngô Đình Diệm'
>>> ud.normalize('NFD',s).encode('ascii',errors='ignore').decode('ascii')
'Ngo inh Diem' # error
You can work around the exceptions with a translation table:
>>> table = {ord('Đ'):'D',ord('đ'):'d'}
>>> ud.normalize('NFD',s).translate(table).encode('ascii',errors='ignore').decode('ascii')
'Ngo Dinh Diem'
This question already has answers here:
Split a string to even sized chunks
(9 answers)
Closed 8 years ago.
how can I separate a string: "Blahblahblahblah" into "Blah" "blah" "blah" "blah" on python. I've tried the following:
str = "Blahblahblahblah"
for letter[0:3] on str
How can I do it?
If you do not mind to use re library. In this example the regex .{4} means any character except \n of length 4.
import re
str = "Blahblahblahblah"
print re.findall(".{4}", str)
output:
['Blah', 'blah', 'blah', 'blah']
Note: str is not a very good name for a variable name. Because there is a function named str() in python that converts the given variable into a string.
Try:
>>> SUBSTR_LEN = 4
>>> string = "bla1bla2bla3bla4"
>>> [string[n:n + SUBSTR_LEN] for n in range(0, len(string), SUBSTR_LEN)]
['bla1', 'bla2', 'bla3', 'bla4']
This question already has answers here:
How to convert string to Title Case in Python?
(10 answers)
Closed 9 years ago.
I'm having trouble trying to create a function that can do this job. The objective is to convert strings like
one to One
hello_world to HelloWorld
foo_bar_baz to FooBarBaz
I know that the proper way to do this is using re.sub, but I'm having trouble creating the right regular expressions to do the job.
You can try something like this:
>>> s = 'one'
>>> filter(str.isalnum, s.title())
'One'
>>>
>>> s = 'hello_world'
>>> filter(str.isalnum, s.title())
'HelloWorld'
>>>
>>> s = 'foo_bar_baz'
>>> filter(str.isalnum, s.title())
'FooBarBaz'
Relevant documentation:
str.title()
str.isalnum()
filter()
Found solution:
def uppercase(name):
return ''.join(x for x in name.title() if not x.isspace()).replace('_', '')
This question already has answers here:
Python Trailing L Problem
(5 answers)
Closed 9 years ago.
I receive from a module a string that is a representation of an long int
>>> str = hex(5L)
>>> str
'0x5L'
What I now want is to convert the string str back to a number (integer)
int(str,16) does not work because of the L.
Is there a way to do this without stripping the last L out of the string? Because it is also possible that the string contains a hex without the L ?
Use str.rstrip; It works for both cases:
>>> int('0x5L'.rstrip('L'),16)
5
>>> int('0x5'.rstrip('L'),16)
5
Or generate the string this way:
>>> s = '{:#x}'.format(5L) # remove '#' if you don' want '0x'
>>> s
'0x5'
>>> int(s, 16)
5
You could even just use:
>>> str = hex(5L)
>>> long(str,16)
5L
>>> int(long(str,16))
5
>>>
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Python split string on regex
How do I split some text using Python's re module into two parts: the text before a special word cut and the rest of the text following it.
You can do it with re:
>>> import re
>>> re.split('cut', s, 1) # Split only once.
But in this case you can just use str.split:
>>> s.split('cut', 1) # Split only once.
Check this, might help you
>>> re.compile('[0-9]+').split("hel2l3o")
['hel', 'l', 'o']
>>>
>>> re.compile('cut').split("hellocutworldcutpython")
['hello', 'world', 'python']
split about first cut
>>> l=re.compile('cut').split("hellocutworldcutpython")
>>> print l[0], string.join([l[i] for i in range(1, len(l))], "")
hello worldpython