How do I count letters in a string?

How do I count letters in a string? - python

Say I have a UTF-8 str, for example
my_str = "नमस्ते" # ['न', 'म', 'स', '्', 'त', 'े']
how do I find how many letters it contains? len(my_str) returns 6, which is how many Unicode code points it contains. It's actually 4 letters long.
And bonus question: some languages define digraphs as a single letter (for example "Dh" is the 6th letter of the modern Albanian alphabet), how can I handle that edge case?

You want to segment text. This is governed in Unicode by UAX #29.
4 letters long
That terminology is incorrect/too narrow, it should say "4 grapheme clusters long".
Use the uniseg library:
from uniseg.graphemecluster import grapheme_clusters
for text in ('नमस्ते', 'Bo\u0304ris', 'Ꙝ̛͖͋҉ᄀᄀᄀ각ᆨᆨ'):
print(list(grapheme_clusters(text)))
#['न', 'म', 'स्', 'ते']
#['B', 'ō', 'r', 'i', 's']
#['Ꙝ̛͋', 'ᄀᄀᄀ각ᆨᆨ']
# treat digraph 'dh' as a customised grapheme cluster
def albanian_digraph_dh(s, breakables):
for i, breakable in enumerate(breakables):
if s.endswith('d', 0, i) and s.startswith('h', i):
yield 0
else:
yield breakable
# you can do all the digraphs like this
ALBANIAN_DIGRAPHS = {"Dh", "Gj", "Ll", "Nj", "Rr", "Sh", "Th", "Xh", "Zh"}
ALBANIAN_DIGRAPHS |= {digraph.lower() for digraph in ALBANIAN_DIGRAPHS}
def albanian_digraphs(s, breakables):
for i, breakable in enumerate(breakables):
yield 0 if s[i-1:i+1] in ALBANIAN_DIGRAPHS else breakable
# from https://sq.wiktionary.org/wiki/Speciale:PrefixIndex?prefix=dh
for text in ('dhallanik', 'dhelpëror', 'dhembshurisht', 'dhevështrues', 'dhimbshëm', 'dhjamosje', 'dhjetëballësh', 'dhjetëminutësh', 'dhogaç', 'dhogiç', 'dhomë-muze', 'dhuratë', 'dhëmbinxhi', 'dhëmbçoj', 'dhëmbëkatarosh'):
print(list(grapheme_clusters(text, albanian_digraphs)))
#['dh', 'a', 'll', 'a', 'n', 'i', 'k']
#['dh', 'e', 'l', 'p', 'ë', 'r', 'o', 'r']
#['dh', 'e', 'm', 'b', 'sh', 'u', 'r', 'i', 'sh', 't']
#['dh', 'e', 'v', 'ë', 'sh', 't', 'r', 'u', 'e', 's']
#['dh', 'i', 'm', 'b', 'sh', 'ë', 'm']
#['dh', 'j', 'a', 'm', 'o', 's', 'j', 'e']
#['dh', 'j', 'e', 't', 'ë', 'b', 'a', 'll', 'ë', 'sh']
#['dh', 'j', 'e', 't', 'ë', 'm', 'i', 'n', 'u', 't', 'ë', 'sh']
#['dh', 'o', 'g', 'a', 'ç']
#['dh', 'o', 'g', 'i', 'ç']
#['dh', 'o', 'm', 'ë', '-', 'm', 'u', 'z', 'e']
#['dh', 'u', 'r', 'a', 't', 'ë']
#['dh', 'ë', 'm', 'b', 'i', 'n', 'xh', 'i']
#['dh', 'ë', 'm', 'b', 'ç', 'o', 'j']
#['dh', 'ë', 'm', 'b', 'ë', 'k', 'a', 't', 'a', 'r', 'o', 'sh']
You can install it with
pip install uniseg

Related

Python [x::y] slice operator - why doesn't work for me?

I have a list like this:
residL=['M', 'P', 'P', 'M', 'L', 'S', 'G', 'L', 'L', 'A', 'R', 'L', 'V', 'K', 'L', 'L', 'L', 'G', 'R', 'H', 'G', 'S', 'A', 'L', 'H', 'W', 'R', 'A', 'A', 'G', 'A', 'A', 'T', 'V', 'L', 'L', 'V', 'I', 'V', 'L', 'L', 'A', 'G', 'S', 'Y', 'L', 'A', 'V', 'L', 'A']
Desired output:
residL = ['M', 'P', 'P', 'M', 'L', 'S', 'G', 'L', 'L', 'A\n10', 'R', 'L', 'V', 'K', 'L', 'L', 'L', 'G', 'R', 'H\n20', 'G', 'S', 'A', 'L', 'H', 'W', 'R', 'A', 'A', 'G\n30', 'A', 'A', 'T', 'V', 'L', 'L', 'V', 'I', 'V', 'L\n40', 'L', 'A', 'G', 'S', 'Y', 'L', 'A', 'V', 'L', 'A\n50']
I can get this output with this piece of code:
for i in range(9,len(residL), 10):
residL[i] = '%s\n%i'%(residL[i], i+1)
But I wanted to go fancy, so I tried the slice operator:
residL[9::10] = [x+'\n%i'%(residL.index(x)+1) for x in residL[9::10]]
I got a strange result though:
residL = ['M', 'P', 'P', 'M', 'L', 'S', 'G', 'L', 'L', 'A\n10', 'R', 'L', 'V', 'K', 'L', 'L', 'L', 'G', 'R', 'H\n20', 'G', 'S', 'A', 'L', 'H', 'W', 'R', 'A', 'A', 'G\n7', 'A', 'A', 'T', 'V', 'L', 'L', 'V', 'I', 'V', 'L\n5', 'L', 'A', 'G', 'S', 'Y', 'L', 'A', 'V', 'L', 'A\n10']
I'm wondering how could it be fixed. Just for the sake of learning. :)

index is finding an earlier appearance of the same letter. Instead, use enumerate to track the index yourself.
residL=['M', 'P', 'P', 'M', 'L', 'S', 'G', 'L', 'L', 'A', 'R', 'L', 'V', 'K', 'L', 'L', 'L', 'G', 'R', 'H', 'G', 'S', 'A', 'L', 'H', 'W', 'R', 'A', 'A', 'G', 'A', 'A', 'T', 'V', 'L', 'L', 'V', 'I', 'V', 'L', 'L', 'A', 'G', 'S', 'Y', 'L', 'A', 'V', 'L', 'A']
residL[9::10] = [x+'\n%i'%((i+1)*10) for i, x in enumerate(residL[9::10])]
residL
# => ['M', 'P', 'P', 'M', 'L', 'S', 'G', 'L', 'L', 'A\n10', 'R', 'L', 'V', 'K', 'L', 'L', 'L', 'G', 'R', 'H\n20', 'G', 'S', 'A', 'L', 'H', 'W', 'R', 'A', 'A', 'G\n30', 'A', 'A', 'T', 'V', 'L', 'L', 'V', 'I', 'V', 'L\n40', 'L', 'A', 'G', 'S', 'Y', 'L', 'A', 'V', 'L', 'A\n50']

Use enumerate to keep track of index
>>> [x if (i-9)%10 else x+f'\n{i+1}' for i,x in enumerate(residL)]
['M', 'P', 'P', 'M', 'L', 'S', 'G', 'L', 'L', 'A\n10', 'R', 'L', 'V', 'K', 'L', 'L', 'L', 'G', 'R', 'H\n20', 'G', 'S', 'A', 'L', 'H', 'W', 'R', 'A', 'A', 'G\n30', 'A', 'A', 'T', 'V', 'L', 'L', 'V', 'I', 'V', 'L\n40', 'L', 'A', 'G', 'S', 'Y', 'L', 'A', 'V', 'L', 'A\n50']

Python: Alphabet array sort

I was trying a sample exercise on regexes. To find all the letters of the alphabets. Sort the array, and finally eliminate all repetitions.
>>> letterRegex = re.compile(r'[a-z]')
>>> alphabets = letterRegex.findall("The quick brown fox jumped over the lazy dog")
>>> alphabets.sort()
>>> alphabets
['a', 'b', 'c', 'd', 'd', 'e', 'e', 'e', 'e', 'f', 'g', 'h', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'o', 'o', 'o', 'p', 'q', 'r', 'r', 't', 'u', 'u', 'v', 'w', 'x', 'y', 'z']
After doing the sort I tried to make a loop that'll eliminate all repetitions in the array.
e.g [...'e', 'e'...]
So I did this
>>> i, j = -1,0
>>> for items in range(len(alphabets)):
if alphabets[i+1] == alphabets[j+1]:
alphabets.remove(alphabets[j])
However it didn't work. How can I remove repetitons?

Here's a much easier way of removing co-occurrences:
import itertools
L = ['a', 'b', 'c', 'd', 'd', 'e', 'e', 'e', 'e', 'f', 'g', 'h', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'o', 'o', 'o', 'p', 'q', 'r', 'r', 't', 'u', 'u', 'v', 'w', 'x', 'y', 'z']
answer = []
for k,_group in itertools.groupby(L):
answer.append(k)
Or simpler still:
answer = [k for k,_g in itertools.groupby(L)]
Both yield this:
In [42]: print(answer)
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 't', 'u', 'v', 'w', 'x', 'y', 'z']

split long string inside one list to small lists

How do you split this long string of one list to small multi-lists as show on the output ? (I have file has 100 lines)
Num=['S', 'I', 'R', 'T', 'S', 'A', 'V', 'P', 'S', 'P', 'C', 'G', 'K', 'Y', 'Y', 'T', 'L', 'N', 'G', 'S', 'K', '\n', ',', 'S', 'T', 'P', 'C', 'T', 'T', 'I', 'N', 'K', 'V', 'K', 'A', 'S', 'G', 'M', 'K', 'A', 'I', 'M', 'M', 'A', '\n']
Output should look like this:
['S', 'I', 'R', 'T', 'S', 'A', 'V', 'P', 'S', 'P', 'K', 'G', 'K', 'Y', 'Y', 'T', 'L', 'N', 'G', 'S', 'K']
['S', 'T', 'P', 'C', 'T', 'T', 'I', 'N', 'K', 'V', 'K', 'A', 'S', 'G', 'M', 'K', 'A', 'I', 'M', 'M', 'A']

First join the elements, strip() leading-trailing whitespace characters, split on a new line \n and comma , and then map them to a list again.
In short:
l1, l2 = map(list, "".join(Num).strip().split('\n,'))
Now, l1, l2 look, respectively:
['S', 'I', 'R', 'T', 'S', 'A', 'V', 'P', 'S', 'P', 'C', 'G', 'K', 'Y', 'Y', 'T', 'L', 'N', 'G', 'S', 'K']
and
['S', 'T', 'P', 'C', 'T', 'T', 'I', 'N', 'K', 'V', 'K', 'A', 'S', 'G', 'M', 'K', 'A', 'I', 'M', 'M', 'A']

PyEnchant weird behavior for numbers

I am using PyEnchant for some spelling/grammar correction scripting. I have noticed this behavior on my Mac:
>>> import enchant
>>> d = enchant.Dict('en_us')
>>> d.suggest('50')
['W', 'Y', 'w', 'y', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'x', 'z']
>>> enchant.__version__
'1.6.6'
However, it works more predictably on my linux machine (same version of pyenchant)
>>> import enchant
>>> d = enchant.Dict('en_us')
>>> d.suggest('50')
['5', '0', '50s']

It is due to the underlying provider. On Ubuntu I have an en_US dictionary installed for both myspell and aspell. If I switch providers I get different results. E.g. with a script like this:
import enchant
b = enchant.Broker()
b.set_ordering("en_US","myspell,aspell")
print b.describe()
d=b.request_dict("en_US")
print d.provider
s = '50'
print d.suggest(s)
b = enchant.Broker()
b.set_ordering("en_US","aspell,myspell")
print b.describe()
d=b.request_dict("en_US")
print d.provider
s = '50'
print d.suggest(s)
I get the following output.
[<Enchant: Aspell Provider>, <Enchant: Ispell Provider>, <Enchant: Myspell Provider>, <Enchant: Hspell Provider>]
<Enchant: Myspell Provider>
['5', '0', '50s']
[<Enchant: Aspell Provider>, <Enchant: Ispell Provider>, <Enchant: Myspell Provider>, <Enchant: Hspell Provider>]
<Enchant: Aspell Provider>
['W', 'Y', 'w', 'y', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'x', 'z']
The first set of suggestions is what you are seeing on Linux but I am using Myspell Provider. The second is what you are seeing on your Mac and I am using Aspell Provider.

Generate a matrix from a given string

I need to create a matrix from a string s where m is the given number of rows and len(s)/m is the number of columns. First column must be filled with the first m chars in the string s (I.E.: 0*m+i chars for every i in range(m) ); the second column with the 1*m+i and so on.
What's the best way to do this in python?
EDIT:
this is the code I wrote by now.
def split_by_n( seq, n ):
"""A generator to divide a sequence into chunks of n units."""
while seq:
yield seq[:n]
seq = seq[n:]
#print list(split_by_n("1234567890",2))
input=list("ZPFKYLGJPNSGNMQGFGCITLVRIWMGFBLBFDSIOAJGBGAVFVHBGLFSRPNIOFSYOBTFCGRQLWWZAAJFUPGAFZSNXLTGARUVFKOLGAIWGUUCMVSEKLIAGJGGUZFBAOILVRIZPORNXWVFRGNMEGCEUNUZSPNIUAHFRQLWALHWEQGQKDFDCCKLUZWFSITKWIKLSMUQKNJUWRTKZAHJGABKDEGEMNCVIMBFRNYXSSKYPWLWHUKKISHFAJPOOFGJBJTBXXSGTRYAJGBNRMYHOGXQBLSFEWVUCHRLEJWAQBIWFRLWSSKRKSBFRAKDFJVRGZUOCJUZEKWAPIQSBRYM")
l = list(split_by_n(input,6))
for i in range(len(l[-2])-len(l[-1])):
l[-1].append('$')
print l

I learned from your comment that you want to make transpose of your matrix that is formed from a given string. Your code creates the matrix from a given string just fine. I have tweaked your code only slightly, and added code for making transpose.
def split_by_n( seq, n ):
while seq:
yield seq[:n]
seq = seq[n:]
def make_matrix(string):
col_count = 6
matrix = list(split_by_n(string,6))
row_count = len(matrix)
# the last row has "less_by" fewer elements than the rest of the rows
less_by = len(matrix[-2]) - len(matrix[-1])
matrix[-1] += '$' * less_by
return matrix
def make_transpose(matrix):
col_count = len(matrix[0])
transpose = []
for i in range(col_count):
transpose.append([row[i] for row in matrix])
return transpose
string = list("ZPFKYLGJPNSGNMQGFGCITLVRIWMGFBLBFDSIOAJGBGAVFVHBGLFSRPNIOFSYOBTFCGRQLWWZAAJFUPGAFZSNXLTGARUVFKOLGAIWGUUCMVSEKLIAGJGGUZFBAOILVRIZPORNXWVFRGNMEGCEUNUZSPNIUAHFRQLWALHWEQGQKDFDCCKLUZWFSITKWIKLSMUQKNJUWRTKZAHJGABKDEGEMNCVIMBFRNYXSSKYPWLWHUKKISHFAJPOOFGJBJTBXXSGTRYAJGBNRMYHOGXQBLSFEWVUCHRLEJWAQBIWFRLWSSKRKSBFRAKDFJVRGZUOCJUZEKWAPIQSBRYM")
matrix = make_matrix(string)
transpose = make_transpose(matrix)
for e in matrix:
print(e)
print('\nThe transpose:')
for e in transpose:
print(e)
Output:
['Z', 'P', 'F', 'K', 'Y', 'L']
['G', 'J', 'P', 'N', 'S', 'G']
['N', 'M', 'Q', 'G', 'F', 'G']
['C', 'I', 'T', 'L', 'V', 'R']
['I', 'W', 'M', 'G', 'F', 'B']
['L', 'B', 'F', 'D', 'S', 'I']
['O', 'A', 'J', 'G', 'B', 'G']
['A', 'V', 'F', 'V', 'H', 'B']
['G', 'L', 'F', 'S', 'R', 'P']
['N', 'I', 'O', 'F', 'S', 'Y']
['O', 'B', 'T', 'F', 'C', 'G']
['R', 'Q', 'L', 'W', 'W', 'Z']
['A', 'A', 'J', 'F', 'U', 'P']
['G', 'A', 'F', 'Z', 'S', 'N']
['X', 'L', 'T', 'G', 'A', 'R']
['U', 'V', 'F', 'K', 'O', 'L']
['G', 'A', 'I', 'W', 'G', 'U']
['U', 'C', 'M', 'V', 'S', 'E']
['K', 'L', 'I', 'A', 'G', 'J']
['G', 'G', 'U', 'Z', 'F', 'B']
['A', 'O', 'I', 'L', 'V', 'R']
['I', 'Z', 'P', 'O', 'R', 'N']
['X', 'W', 'V', 'F', 'R', 'G']
['N', 'M', 'E', 'G', 'C', 'E']
['U', 'N', 'U', 'Z', 'S', 'P']
['N', 'I', 'U', 'A', 'H', 'F']
['R', 'Q', 'L', 'W', 'A', 'L']
['H', 'W', 'E', 'Q', 'G', 'Q']
['K', 'D', 'F', 'D', 'C', 'C']
['K', 'L', 'U', 'Z', 'W', 'F']
['S', 'I', 'T', 'K', 'W', 'I']
['K', 'L', 'S', 'M', 'U', 'Q']
['K', 'N', 'J', 'U', 'W', 'R']
['T', 'K', 'Z', 'A', 'H', 'J']
['G', 'A', 'B', 'K', 'D', 'E']
['G', 'E', 'M', 'N', 'C', 'V']
['I', 'M', 'B', 'F', 'R', 'N']
['Y', 'X', 'S', 'S', 'K', 'Y']
['P', 'W', 'L', 'W', 'H', 'U']
['K', 'K', 'I', 'S', 'H', 'F']
['A', 'J', 'P', 'O', 'O', 'F']
['G', 'J', 'B', 'J', 'T', 'B']
['X', 'X', 'S', 'G', 'T', 'R']
['Y', 'A', 'J', 'G', 'B', 'N']
['R', 'M', 'Y', 'H', 'O', 'G']
['X', 'Q', 'B', 'L', 'S', 'F']
['E', 'W', 'V', 'U', 'C', 'H']
['R', 'L', 'E', 'J', 'W', 'A']
['Q', 'B', 'I', 'W', 'F', 'R']
['L', 'W', 'S', 'S', 'K', 'R']
['K', 'S', 'B', 'F', 'R', 'A']
['K', 'D', 'F', 'J', 'V', 'R']
['G', 'Z', 'U', 'O', 'C', 'J']
['U', 'Z', 'E', 'K', 'W', 'A']
['P', 'I', 'Q', 'S', 'B', 'R']
['Y', 'M', '$', '$', '$', '$']
The transpose:
['Z', 'G', 'N', 'C', 'I', 'L', 'O', 'A', 'G', 'N', 'O', 'R', 'A', 'G', 'X', 'U', 'G', 'U', 'K', 'G', 'A', 'I', 'X', 'N', 'U', 'N', 'R', 'H', 'K', 'K', 'S', 'K', 'K', 'T', 'G', 'G', 'I', 'Y', 'P', 'K', 'A', 'G', 'X', 'Y', 'R', 'X', 'E', 'R', 'Q', 'L', 'K', 'K', 'G', 'U', 'P', 'Y']
['P', 'J', 'M', 'I', 'W', 'B', 'A', 'V', 'L', 'I', 'B', 'Q', 'A', 'A', 'L', 'V', 'A', 'C', 'L', 'G', 'O', 'Z', 'W', 'M', 'N', 'I', 'Q', 'W', 'D', 'L', 'I', 'L', 'N', 'K', 'A', 'E', 'M', 'X', 'W', 'K', 'J', 'J', 'X', 'A', 'M', 'Q', 'W', 'L', 'B', 'W', 'S', 'D', 'Z', 'Z', 'I', 'M']
['F', 'P', 'Q', 'T', 'M', 'F', 'J', 'F', 'F', 'O', 'T', 'L', 'J', 'F', 'T', 'F', 'I', 'M', 'I', 'U', 'I', 'P', 'V', 'E', 'U', 'U', 'L', 'E', 'F', 'U', 'T', 'S', 'J', 'Z', 'B', 'M', 'B', 'S', 'L', 'I', 'P', 'B', 'S', 'J', 'Y', 'B', 'V', 'E', 'I', 'S', 'B', 'F', 'U', 'E', 'Q', '$']
['K', 'N', 'G', 'L', 'G', 'D', 'G', 'V', 'S', 'F', 'F', 'W', 'F', 'Z', 'G', 'K', 'W', 'V', 'A', 'Z', 'L', 'O', 'F', 'G', 'Z', 'A', 'W', 'Q', 'D', 'Z', 'K', 'M', 'U', 'A', 'K', 'N', 'F', 'S', 'W', 'S', 'O', 'J', 'G', 'G', 'H', 'L', 'U', 'J', 'W', 'S', 'F', 'J', 'O', 'K', 'S', '$']
['Y', 'S', 'F', 'V', 'F', 'S', 'B', 'H', 'R', 'S', 'C', 'W', 'U', 'S', 'A', 'O', 'G', 'S', 'G', 'F', 'V', 'R', 'R', 'C', 'S', 'H', 'A', 'G', 'C', 'W', 'W', 'U', 'W', 'H', 'D', 'C', 'R', 'K', 'H', 'H', 'O', 'T', 'T', 'B', 'O', 'S', 'C', 'W', 'F', 'K', 'R', 'V', 'C', 'W', 'B', '$']
['L', 'G', 'G', 'R', 'B', 'I', 'G', 'B', 'P', 'Y', 'G', 'Z', 'P', 'N', 'R', 'L', 'U', 'E', 'J', 'B', 'R', 'N', 'G', 'E', 'P', 'F', 'L', 'Q', 'C', 'F', 'I', 'Q', 'R', 'J', 'E', 'V', 'N', 'Y', 'U', 'F', 'F', 'B', 'R', 'N', 'G', 'F', 'H', 'A', 'R', 'R', 'A', 'R', 'J', 'A', 'R', '$']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I count letters in a string? - python

Related

Python [x::y] slice operator - why doesn't work for me?

Python: Alphabet array sort

split long string inside one list to small lists

PyEnchant weird behavior for numbers

Generate a matrix from a given string

Categories

Resources