Right justify string containing Thai characters - python

I would like to right justify strings containing Thai characters (Thai rendering doesn't work from left to right, but can go up and down as well).
For example, for the strings ไป (two characters, length 2) and ซื้อ (four characters, length 2) I want to have the following output (length 5):
...ไป
...ซื้อ
The naive
print 'ไป'.decode('utf-8').rjust(5)
print 'ซื้อ'.decode('utf-8').rjust(5)
however, respectively produce
...ไป
.ซื้อ
Any ideas how to get to the desired formatting?
EDIT:
Given a string of Thai characters tc, I want to determine how many [places/fields/positions/whatever you want to call it] the string uses. This is not the same as len(tc); len(tc) is usually larger than the number of places used. The second word gives len(tc) = 4, but has length 2 / uses 2 places / uses 2 positions.

Cause
Thai script contains normal characters (positive advance width) and non-spacing marks as well (zero advance width).
For example, in the word ซื้อ:
the first character is the initial consonant "SO SO",
then it has vowel mark SARA UUE,
then tone mark MAI THO,
and then the final pseudo-consonant O ANG
The problem is that characters ##2 and 3 in the list above are zero-width ones.
In other words, they do not make the string "wider".
In yet other words, ซื้อ ("to buy") and ซอ ("fiddle") would have equal width of two character places (but string lengths of 4 and 2, correspondingly).
Solution
In order to calculate the "real" string length, one must skip zero-width characters.
Python-specific
The unicodedata module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 8.0.0.
The unicodedata.category(unichr) method returns one the following General Category Values:
"Lo" for normal character;
"Mn" for zero-width non-spacing marks;
The rest is obvious, simply filter out the latter ones.
Further info:
Unicode data for Thai script (scroll till the first occurrence of "THAI CHARACTER")

I think what you mean to ask is, how to determine the 'true' # of characters in เรือ, ไป, ซื้อ etc. (which are 3,2 and 2, respectively)
Unfortunately, here's how Python interprets these characters:
ไป
>>> 'ไป'
'\xe0\xb9\x84\xe0\xb8\x9b'
>>> len('ไป')
6
>>> len('ไป'.decode('utf-8'))
2
ซื้อ
>>> 'ซื้อ'
'\xe0\xb8\x8b\xe0\xb8\xb7\xe0\xb9\x89\xe0\xb8\xad'
>>> len('ซื้อ')
12
>>> len('ซื้อ'.decode('utf-8'))
4
เรือ
>>> 'เรือ'
'\xe0\xb9\x80\xe0\xb8\xa3\xe0\xb8\xb7\xe0\xb8\xad'
>>> len('เรือ')
12
>>> len('เรือ'.decode('utf-8'))
4
There's no real correlation between the # of characters displayed and the # of actual (from Python's perspective) characters that make up the string.
I can't think of an obvious way to do this. However, I've found this library which might be of help to you. (You will also need to install some prequisites.

It looks like the rjust() function will not work for you and you will need to count the number of cells in the string yourself. You can then insert the number of spaces required before the string to achieve justification
You seem to know about Thai language. Sum the number of consonants, preceding vowels, following vowels and Thai punctuation. Don't count diacritics and above and below vowels.
Something like (forgive my pseudo Python code),
cells = 0
for i in range (0, len(string))
if (string[i] == \xe31) or ((string[i] >= \xe34) and (string[i] <= \xe3a)) or ((string[i] >= \xe47) and (string[i] <= \xe4e))
# do nothing
else
# consonant, preceding or following vowel or punctuation
cells++

Here's a function to compute the length of a thai string (the number of characters arranged horizontally), based on bytebuster's answer
import unicodedata
def get_thai_string_length(string):
length = 0
for c in string:
if unicodedata.category(c) != 'Mn':
length += 1
return length
print(len('บอินทัช'))
print(get_thai_string_length('บอินทัช'))

Related

Finding the right rules to filter certain strings in python list

I have a large plain text file and I have to 'clean' it in python3.
For now I've read it into a list with the following strings:
['[chr10:43612033[C', '[chr10:61665880[G', 'C[chr20:3835205[', ']chr20:3870375]T', 'G]chr6:117650611]']
My goal is to turn it into a list containing only the string of the middle part 'chr_XX:number'. That means I need to find a way to either remove 1 or 2 characters from the beginning and the end of the origninal string.
['chr10:43612033', 'chr10:61665880', 'chr20:3835205', 'chr20:3870375', 'chr6:117650611']
My problem here is, that I can not slice by index as the pattern is:
<chr>+ any number between 1-22 or X or Y
E.g. chr1 or chr22 or chrX or chrY
The part after the : can be any integer number spanning up to 9 digits.
Thus, I cannot just slice by removing the the first x characters or the last x characters.
This is because sometimes I have 2 characters before my relevant string and sometimes only one.
As in:
<any_letter>]chr10:<the_number>
or
]chr10:<the_integer>
or the same story but with the opening square bracket [ .
The same goes for the final part of the string. After my famous integer of any length between 1 and 9 digits i got either ]<any_letter> or just a single ] or same pattern but with the opening square bracket.
Any elegant ideas?
As suggested in the comments, you could simply use regex by utilizing this pattern:
chr<digits>or<XY>:<digits>
Check out this if you want to learn more about regular expressions
Here's a working example:
import re
strings = [
'[chr10:43612033[C',
'[chr10:61665880[G',
'C[chr20:3835205[',
']chr20:3870375]T',
'G]chr6:117650611]',
'G]chrX:117650611]',
'G]chrY:117650611]',
]
print([re.search(r"chr(\d{1,2}?|[X-Y]):\d{,9}", s).group(0) for s in strings])
Output:
['chr10:43612033', 'chr10:61665880', 'chr20:3835205', 'chr20:3870375', 'chr6:117650611', 'chrX:117650611', 'chrY:117650611']

Extract words from a string

Sample Input:
'note - Part model D3H6 with specifications X30G and Y2A is having features 12H89.'
Expected Output:
['D3H6', 'X30G', 'Y2A', '12H89']
My code:
split_note = re.split(r'[.;,\s]\s*', note)
pattern = re.compile("^[a-zA-Z0-9]+$")
#if pattern.match(ini_str):
for a in n2:
if pattern.match(a):
alphaList.append(a)
I need to extract all the alpha numeric words from a split string and store them in a list.
The above code is unable to give expected output.
Maybe this can solve the problem:
import re
# input string
stri = "Part model D3H6 with specifications X30 and Y2 is having features 12H89"
# words tokenization
split = re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",stri)
# this statment returns words containing both numbers and letters
print([word for word in split if bool(re.match('^(?=.*[a-zA-Z])(?=.*[0-9])', word))])
#output: ['D3H6', 'X30', 'Y2', '12H89']
^ and $ are meant for the end and beginning of a line, not of a word.
Besides your example words don't include lower case, so why adding a-z?
Considering your example, if what you need is to fetch a word that always contains both at least one letter and at least one number and always ends with a number, this is the pattern:
\b[0-9A-Z]+\d+\b
If it may end with a letter rather than a digit, but still requires at least one digit and one letter,then it gets more complex:
\b[0-9A-Z]*\d|[A-Z][0-9A-Z]*\b
\b stands for a word boundary.

Reg Ex for specific number in string

I'd like to match numbers (int and real) in a string, but not if they are part of an identifier; e.g., i'd like to match 5.5 or 42, but not x5. Strings are roughly of the form "x5*1.1+42*y=40".
So far, I came up with
([0-9]*[.])?[0-9]+[^.*+=<>]
This correctly ignores x0, but also 0 or 0.5 (12.45, however, works). Changing the + to * leads to wrong matchings.
It would be very nice if someone could point out my error.
Thanks!
This is actually not simple. Float literals are more complex than you assumed, being able to contain an e or E for exponential format. Also, you can have prefixed signs (+ or -) for the number and/or the exponent. All in all it can be done like this:
re.findall(r'(?:(?<![a-zA-Z_0-9])|[+-]\s*)[\d.]+(?:[eE][+-]?\d+)?',
'x5*1.1+42*y=40+a123-3.14e-2')
This returns:
['1.1', '+42', '40', '-3.14e-2']
You should consider though whether a thing like 4+3 should lead to ['4', '3'] or ['4', '-3']. If the input was 4+-3 the '-3' would clearly be preferable. But to distinguish these isn't easy and you should consider using a proper formula parser for these.
Maybe the standard module ast can help you. The expression must be a valid Python expression in this case, so a thing like a+b=40 isn't allowed because left of the equal sign is no proper lvalue. But for valid Python objects you could use ast like this:
import ast
def find_all_numbers(e):
if isinstance(e, ast.BinOp):
for r in find_all_numbers(e.left):
yield r
for r in find_all_numbers(e.right):
yield r
elif isinstance(e, ast.Num):
yield e.n
list(find_all_numbers(ast.parse('x5*1.1+42*y-40').body[0].value))
Returns:
[1.1, 42, 40]
You could do it with something like
\b\d*(\.\d+)?\b
It matches any number of digits (\d*) followed by an optional decimal part ((\.\d+)?). The \b matches word boundaries, i.e. the location between a word character and a non word character. And since both digits and (english) letters are word characters, it won't match the 5 in a sequence like x5.
See this regex101 example.
The main reason your try fails is that it ends with [^.*+=<>] which requires the number (or rather match) to end with a character other than ., *, =, +, < or >. And when ending with a single digit, like 0 and 0.5 , the digit gets eaten by the [0-9]+, and there's nothin to match the [^.*+=<>] left, and thus it fails. In the case with 12.45 it first matches 12.4 and then the [^.*+=<>] matches the 5.
Do something like ((?<![a-zA-Z_])\d+(\.\d+)?)
It is using negative lookbehind in order not to select anything having [a-zA-Z_] prior to it.
Check it out here in Regex101.
About your regex ([0-9]*[.])?[0-9]+[^.*+=<>] use [0-9]+ instead of [0-9]* as it will not allow .05 to be captured, only 0.5. Another thing is [^.*+=<>] this part, you could add ? to the end of it in order to allow it not to have characters as well. Example 1.1 wont be captured as ([0-9]*[.])?[0-9]+ is satisfied but not [^.*+=<>] that comes after it as well.

Can't convert 'list'object to str implicitly Python

I am trying to import the alphabet but split it so that each character is in one array but not one string. splitting it works but when I try to use it to find how many characters are in an inputted word I get the error 'TypeError: Can't convert 'list' object to str implicitly'. Does anyone know how I would go around solving this? Any help appreciated. The code is below.
import string
alphabet = string.ascii_letters
print (alphabet)
splitalphabet = list(alphabet)
print (splitalphabet)
x = 1
j = year3wordlist[x].find(splitalphabet)
k = year3studentwordlist[x].find(splitalphabet)
print (j)
EDIT: Sorry, my explanation is kinda bad, I was in a rush. What I am wanting to do is count each individual letter of a word because I am coding a spelling bee program. For example, if the correct word is 'because', and the user who is taking part in the spelling bee has entered 'becuase', I want the program to count the characters and location of the characters of the correct word AND the user's inputted word and compare them to give the student a mark - possibly by using some kind of point system. The problem I have is that I can't simply say if it is right or wrong, I have to award 1 mark if the word is close to being right, which is what I am trying to do. What I have tried to do in the code above is split the alphabet and then use this to try and find which characters have been used in the inputted word (the one in year3studentwordlist) versus the correct word (year3wordlist).
There is a much simpler solution if you use the in keyword. You don't even need to split the alphabet in order to check if a given character is in it:
year3wordlist = ['asdf123', 'dsfgsdfg435']
total_sum = 0
for word in year3wordlist:
word_sum = 0
for char in word:
if char in string.ascii_letters:
word_sum += 1
total_sum += word_sum
# Length of characters in the ascii letters alphabet:
# total_sum == 12
# Length of all characters in all words:
# sum([len(w) for w in year3wordlist]) == 18
EDIT:
Since the OP comments he is trying to create a spelling bee contest, let me try to answer more specifically. The distance between a correctly spelled word and a similar string can be measured in many different ways. One of the most common ways is called 'edit distance' or 'Levenshtein distance'. This represents the number of insertions, deletions or substitutions that would be needed to rewrite the input string into the 'correct' one.
You can find that distance implemented in the Python-Levenshtein package. You can install it via pip:
$ sudo pip install python-Levenshtein
And then use it like this:
from __future__ import division
import Levenshtein
correct = 'because'
student = 'becuase'
distance = Levenshtein.distance(correct, student) # distance == 2
mark = ( 1 - distance / len(correct)) * 10 # mark == 7.14
The last line is just a suggestion on how you could derive a grade from the distance between the student's input and the correct answer.
I think what you need is join:
>>> "".join(splitalphabet)
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
join is a class method of str, you can do
''.join(splitalphabet)
or
str.join('', splitalphabet)
To convert the list splitalphabet to a string, so you can use it with the find() function you can use separator.join(iterable):
"".join(splitalphabet)
Using it in your code:
j = year3wordlist[x].find("".join(splitalphabet))
I don't know why half the answers are telling you how to put the split alphabet back together...
To count the number of characters in a word that appear in the splitalphabet, do it the functional way:
count = len([c for c in word if c in splitalphabet])
import string
# making letters a set makes "ch in letters" very fast
letters = set(string.ascii_letters)
def letters_in_word(word):
return sum(ch in letters for ch in word)
Edit: it sounds like you should look at Levenshtein edit distance:
from Levenshtein import distance
distance("because", "becuase") # => 2
While join creates the string from the split, you would not have to do that as you can issue the find on the original string (alphabet). However, I do not think is what you are trying to do. Note that the find that you are trying attempts to find the splitalphabet (actually alphabet) within year3wordlist[x] which will always fail (-1 result)
If what you are trying to do is to get the indices of all the letters of the word list within the alphabet, then you would need to handle it as
for each letter in the word of the word list, determine the index within alphabet.
j = []
for c in word:
j.append(alphabet.find(c))
print j
On the other hand if you are attempting to find the index of each character within the alphabet within the word, then you need to loop over splitalphabet to get an individual character to find within the word. That is
l = []
for c within splitalphabet:
j = word.find(c)
if j != -1:
l.append((c, j))
print l
This gives the list of tuples showing those characters found and the index.
I just saw that you talk about counting the number of letters. I am not sure what you mean by this as len(word) gives the number of characters in each word while len(set(word)) gives the number of unique characters. On the other hand, are you saying that your word might have non-ascii characters in it and you want to count the number of ascii characters in that word? I think that you need to be more specific in what you want to determine.
If what you are doing is attempting to determine if the characters are all alphabetic, then all you need to do is use the isalpha() method on the word. You can either say word.isalpha() and get True or False or check each character of word to be isalpha()

How to find and count emoticons in a string using python?

This topic has been addressed for text based emoticons at link1, link2, link3. However, I would like to do something slightly different than matching simple emoticons. I'm sorting through tweets that contain the emoticons' icons. The following unicode information contains just such emoticons: pdf.
Using a string with english words that also contains any of these emoticons from the pdf, I would like to be able to compare the number of emoticons to the number of words.
The direction that I was heading down doesn't seem to be the best option and I was looking for some help. As you can see in the script below, I was just planning to do the work from the command line:
$cat <file containing the strings with emoticons> | ./emo.py
emo.py psuedo script:
import re
import sys
for row in sys.stdin:
print row.decode('utf-8').encode("ascii","replace")
#insert regex to find the emoticons
if match:
#do some counting using .split(" ")
#print the counting
The problem that I'm running into is the decoding/encoding. I haven't found a good option for how to encode/decode the string so I can correctly find the icons. An example of the string that I want to search to find the number of words and emoticons is as follows:
"Smiley emoticon rocks! I like you."
The challenge: can you make a script that counts the number of words and emoticons in this string? Notice that the emoticons are both sitting next to the words with no space in between.
First, there is no need to encode here at all. You're got a Unicode string, and the re engine can handle Unicode, so just use it.
A character class can include a range of characters, by specifying the first and last with a hyphen in between. And you can specify Unicode characters that you don't know how to type with \U escape sequences. So:
import re
s=u"Smiley emoticon rocks!\U0001f600 I like you.\U0001f601"
count = len(re.findall(ru'[\U0001f600-\U0001f650]', s))
Or, if the string is big enough that building up the whole findall list seems wasteful:
emoticons = re.finditer(ru'[\U0001f600-\U0001f650]', s)
count = sum(1 for _ in emoticons)
Counting words, you can do separately:
wordcount = len(s.split())
If you want to do it all at once, you can use an alternation group:
word_and_emoticon_count = len(re.findall(ru'\w+|[\U0001f600-\U0001f650]', s))
As #strangefeatures points out, Python versions before 3.3 allowed "narrow Unicode" builds. And, for example, most CPython Windows builds are narrow. In narrow builds, characters can only be in the range U+0000 to U+FFFF. There's no way to search for these characters, but that's OK, because they're don't exist to search for; you can just assume they don't exist if you get an "invalid range" error compiling the regexp.
Except, of course, that there's a good chance that wherever you're getting your actual strings from, they're UTF-16-BE or UTF-16-LE, so the characters do exist, they're just encoded into surrogate pairs. And you want to match those surrogate pairs, right? So you need to translate your search into a surrogate-pair search. That is, convert your high and low code points into surrogate pair code units, then (in Python terms) search for:
(lead == low_lead and lead != high_lead and low_trail <= trail <= DFFF or
lead == high_lead and lead != low_lead and DC00 <= trail <= high_trail or
low_lead < lead < high_lead and DC00 <= trail <= DFFF)
You can leave off the second condition in the last case if you're not worried about accepting bogus UTF-16.
If it's not obvious how that translates into regexp, here's an example for the range [\U0001e050-\U0001fbbf] in UTF-16-BE:
(\ud838[\udc50-\udfff])|([\ud839-\ud83d].)|(\ud83e[\udc00-\udfbf])
Of course if your range is small enough that low_lead == high_lead this gets simpler. For example, the original question's range can be searched with:
\ud83d[\ude00-\ude50]
One last trick, if you don't actually know whether you're going to get UTF-16-LE or UTF-16-BE (and the BOM is far away from the data you're searching): Because no surrogate lead or trail code unit is valid as a standalone character or as the other end of a pair, you can just search in both directions:
(\ud838[\udc50-\udfff])|([\ud839-\ud83d][\udc00-\udfff])|(\ud83e[\udc00-\udfbf])|
([\udc50-\udfff]\ud838)|([\udc00-\udfff][\ud839-\ud83d])|([\udc00-\udfbf]\ud83e)
My solution includes the emoji and regex modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like 👨‍👩‍👦‍👦 once, although it consists of 4 emojis.
import emoji
import regex
def split_count(text):
emoji_counter = 0
data = regex.findall(r'\X', text)
for word in data:
if any(char in emoji.UNICODE_EMOJI for char in word):
emoji_counter += 1
# Remove from the given text the emojis
text = text.replace(word, '')
words_counter = len(text.split())
return emoji_counter, words_counter
Testing:
line = "hello 👩🏾‍🎓 emoji hello 👨‍👩‍👦‍👦 how are 😊 you today🙅🏽🙅🏽"
counter = split_count(line)
print("Number of emojis - {}, number of words - {}".format(counter[0], counter[1]))
Output:
Number of emojis - 5, number of words - 7
If you are trying to read unicode characters outside the ascii range, don't convert into the ascii range. Just leave it as unicode and work from there (untested):
import sys
count = 0
emoticons = set(range(int('1f600',16), int('1f650', 16)))
for row in sys.stdin:
for char in row:
if ord(char) in emoticons:
count += 1
print "%d emoticons found" % count
Not the best solution, but it should work.
This is my solution using re:
import re
text = "your text with emojis"
em_count = len(re.findall(r'[^\w\s,.]', text))
print(em_count)

Categories