How to find and count emoticons in a string using python? - python

This topic has been addressed for text based emoticons at link1, link2, link3. However, I would like to do something slightly different than matching simple emoticons. I'm sorting through tweets that contain the emoticons' icons. The following unicode information contains just such emoticons: pdf.
Using a string with english words that also contains any of these emoticons from the pdf, I would like to be able to compare the number of emoticons to the number of words.
The direction that I was heading down doesn't seem to be the best option and I was looking for some help. As you can see in the script below, I was just planning to do the work from the command line:
$cat <file containing the strings with emoticons> | ./emo.py
emo.py psuedo script:
import re
import sys
for row in sys.stdin:
print row.decode('utf-8').encode("ascii","replace")
#insert regex to find the emoticons
if match:
#do some counting using .split(" ")
#print the counting
The problem that I'm running into is the decoding/encoding. I haven't found a good option for how to encode/decode the string so I can correctly find the icons. An example of the string that I want to search to find the number of words and emoticons is as follows:
"Smiley emoticon rocks! I like you."
The challenge: can you make a script that counts the number of words and emoticons in this string? Notice that the emoticons are both sitting next to the words with no space in between.

First, there is no need to encode here at all. You're got a Unicode string, and the re engine can handle Unicode, so just use it.
A character class can include a range of characters, by specifying the first and last with a hyphen in between. And you can specify Unicode characters that you don't know how to type with \U escape sequences. So:
import re
s=u"Smiley emoticon rocks!\U0001f600 I like you.\U0001f601"
count = len(re.findall(ru'[\U0001f600-\U0001f650]', s))
Or, if the string is big enough that building up the whole findall list seems wasteful:
emoticons = re.finditer(ru'[\U0001f600-\U0001f650]', s)
count = sum(1 for _ in emoticons)
Counting words, you can do separately:
wordcount = len(s.split())
If you want to do it all at once, you can use an alternation group:
word_and_emoticon_count = len(re.findall(ru'\w+|[\U0001f600-\U0001f650]', s))
As #strangefeatures points out, Python versions before 3.3 allowed "narrow Unicode" builds. And, for example, most CPython Windows builds are narrow. In narrow builds, characters can only be in the range U+0000 to U+FFFF. There's no way to search for these characters, but that's OK, because they're don't exist to search for; you can just assume they don't exist if you get an "invalid range" error compiling the regexp.
Except, of course, that there's a good chance that wherever you're getting your actual strings from, they're UTF-16-BE or UTF-16-LE, so the characters do exist, they're just encoded into surrogate pairs. And you want to match those surrogate pairs, right? So you need to translate your search into a surrogate-pair search. That is, convert your high and low code points into surrogate pair code units, then (in Python terms) search for:
(lead == low_lead and lead != high_lead and low_trail <= trail <= DFFF or
lead == high_lead and lead != low_lead and DC00 <= trail <= high_trail or
low_lead < lead < high_lead and DC00 <= trail <= DFFF)
You can leave off the second condition in the last case if you're not worried about accepting bogus UTF-16.
If it's not obvious how that translates into regexp, here's an example for the range [\U0001e050-\U0001fbbf] in UTF-16-BE:
(\ud838[\udc50-\udfff])|([\ud839-\ud83d].)|(\ud83e[\udc00-\udfbf])
Of course if your range is small enough that low_lead == high_lead this gets simpler. For example, the original question's range can be searched with:
\ud83d[\ude00-\ude50]
One last trick, if you don't actually know whether you're going to get UTF-16-LE or UTF-16-BE (and the BOM is far away from the data you're searching): Because no surrogate lead or trail code unit is valid as a standalone character or as the other end of a pair, you can just search in both directions:
(\ud838[\udc50-\udfff])|([\ud839-\ud83d][\udc00-\udfff])|(\ud83e[\udc00-\udfbf])|
([\udc50-\udfff]\ud838)|([\udc00-\udfff][\ud839-\ud83d])|([\udc00-\udfbf]\ud83e)

My solution includes the emoji and regex modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like 👨‍👩‍👦‍👦 once, although it consists of 4 emojis.
import emoji
import regex
def split_count(text):
emoji_counter = 0
data = regex.findall(r'\X', text)
for word in data:
if any(char in emoji.UNICODE_EMOJI for char in word):
emoji_counter += 1
# Remove from the given text the emojis
text = text.replace(word, '')
words_counter = len(text.split())
return emoji_counter, words_counter
Testing:
line = "hello 👩🏾‍🎓 emoji hello 👨‍👩‍👦‍👦 how are 😊 you today🙅🏽🙅🏽"
counter = split_count(line)
print("Number of emojis - {}, number of words - {}".format(counter[0], counter[1]))
Output:
Number of emojis - 5, number of words - 7

If you are trying to read unicode characters outside the ascii range, don't convert into the ascii range. Just leave it as unicode and work from there (untested):
import sys
count = 0
emoticons = set(range(int('1f600',16), int('1f650', 16)))
for row in sys.stdin:
for char in row:
if ord(char) in emoticons:
count += 1
print "%d emoticons found" % count
Not the best solution, but it should work.

This is my solution using re:
import re
text = "your text with emojis"
em_count = len(re.findall(r'[^\w\s,.]', text))
print(em_count)

Related

How do I find index in string without using find() in python

I'm currently using the find function and found a slight problem.
theres gonna be a fire here
If I have a sentence with the word "here" and "theres" and I use find() to find "here"s index, I instead get "theres"
I thought find() would be like
if thisword in thatword:
as it would find the word, not a substring within a string.
Is there another function that may work similarly? I'm using find() quite heavily would like to know of alternatives before I clog the code with string.split() then iterate until I find the exact match with an index counter on the side.
MainLine = str('theres gonna be a fire here')
WordtoFind = str('here')
#String_Len = MainLine.find(WordtoFind)
split_line = MainLine.split()
indexCounter = 0
for i in range (0,len(split_line)):
indexCounter += (len(split_line[i]) + 1)
if WordtoFind in split_line[i]:
#String_Len = MainLine.find(split_line[i])
String_Len = indexCounter
break
The best route would be regular expressions. To find a "word" just make sure that the leading and ending characters are not alphanumeric. It uses no splits, has no exposed loops, and even works when you run into a weird sentence like "There is a fire,here". A find_word function might look like this
import re
def find_word_start(word, string):
pattern = "(?<![a-zA-Z0-9])"+word+"(?![a-zA-Z0-9])"
result = re.search(pattern, string)
return result.start()
>> find_word_start("here", "There is a fire,here")
>> 16
The regex I made uses a trick called lookarounds that make sure that the characters preceding and after the word are not letters or numbers. https://www.regular-expressions.info/lookaround.html. The term [a-zA-Z0-9] is a character set that is comprised of a single character in the sets a-z, A-Z, and 0-9. Look up the python re module to find out more about regular expressions.

Separating between Hebrew and English strings

So I have this huge list of strings in Hebrew and English, and I want to extract from them only those in Hebrew, but couldn't find a regex example that works with Hebrew.
I have tried the stupid method of comparing every character:
import string
data = []
for s in slist:
found = False
for c in string.ascii_letters:
if c in s:
found = True
if not found:
data.append(s)
And it works, but it is of course very slow and my list is HUGE.
Instead of this, I tried comparing only the first letter of the string to string.ascii_letters which was much faster, but it only filters out those that start with an English letter, and leaves the "mixed" strings in there. I only want those that are "pure" Hebrew.
I'm sure this can be done much better... Help, anyone?
P.S: I prefer to do it within a python program, but a grep command that does the same would also help
To check if a string contains any ASCII letters (ie. non-Hebrew) use:
re.search('[' + string.ascii_letters + ']', s)
If this returns true, your string is not pure Hebrew.
This one should work:
import re
data = [s for s in slist if re.match('^[a-zA-Z ]+$', s)]
This will pick all the strings that consist of lowercase and uppercase English letters and spaces. If the strings are allowed to contain digits or punctuation marks, the allowed characters should be included into the regex.
Edit: Just noticed, it filters out the English-only strings, but you need it do do the other way round. You can try this instead:
data = [s for s in slist if not re.match('^.*[a-zA-Z].*$', s)]
This will discard any string that contains at least one English letter.
Python has extensive unicode support. It depends on what you're asking for. Is a hebrew word one that contains only hebrew characters and whitespace, or is it simply a word that contains no latin characters? Either way, you can do so directly. Just create the criteria set and test for membership.
Note that testing for membership in a set is much faster than iteration through string.ascii_letters.
Please note that I do not speak hebrew so I may have missed a letter or two of the alphabet.
def is_hebrew(word):
hebrew = set("א‎ב‎ג‎ד‎ה‎ו‎ז‎ח‎ט‎י‎כ‎ך‎ל‎מ‎נ‎ס‎ ע‎פ‎צ‎ק‎ר‎ש‎ת‎ם‎ן‎ף‎ץ"+string.whitespace)
for char in word:
if char not in hebrew:
return False
return True
def contains_latin(word):
return any(char in set("abcdefghijklmnopqrstuvwxyz") for char in word.lower())
# a generator expression like this is a terser way of expressing the
# above concept.
hebrew_words = [word for word in words if is_hebrew(word)]
non_latin words = [word for word in words if not contains_latin(word)]
Another option would be to create a dictionary of hebrew words:
hebrew_words = {...}
And then you iterate through the list of words and compare them against this dictionary ignoring case. This will work much faster than other approaches (O(n) where n is the length of your list of words).
The downside is that you need to get all or most of hebrew words somewhere. I think it's possible to find it on the web in csv or some other form. Parse it and put it into python dictionary.
However, it makes sense if you need to parse such lists of words very often and quite quickly. Another problem is that the dictionary may contain not all hebrew words which will not give a completely right answer.
Try this:
>>> import re
>>> filter(lambda x: re.match(r'^[^\w]+$',x),s)

Can't convert 'list'object to str implicitly Python

I am trying to import the alphabet but split it so that each character is in one array but not one string. splitting it works but when I try to use it to find how many characters are in an inputted word I get the error 'TypeError: Can't convert 'list' object to str implicitly'. Does anyone know how I would go around solving this? Any help appreciated. The code is below.
import string
alphabet = string.ascii_letters
print (alphabet)
splitalphabet = list(alphabet)
print (splitalphabet)
x = 1
j = year3wordlist[x].find(splitalphabet)
k = year3studentwordlist[x].find(splitalphabet)
print (j)
EDIT: Sorry, my explanation is kinda bad, I was in a rush. What I am wanting to do is count each individual letter of a word because I am coding a spelling bee program. For example, if the correct word is 'because', and the user who is taking part in the spelling bee has entered 'becuase', I want the program to count the characters and location of the characters of the correct word AND the user's inputted word and compare them to give the student a mark - possibly by using some kind of point system. The problem I have is that I can't simply say if it is right or wrong, I have to award 1 mark if the word is close to being right, which is what I am trying to do. What I have tried to do in the code above is split the alphabet and then use this to try and find which characters have been used in the inputted word (the one in year3studentwordlist) versus the correct word (year3wordlist).
There is a much simpler solution if you use the in keyword. You don't even need to split the alphabet in order to check if a given character is in it:
year3wordlist = ['asdf123', 'dsfgsdfg435']
total_sum = 0
for word in year3wordlist:
word_sum = 0
for char in word:
if char in string.ascii_letters:
word_sum += 1
total_sum += word_sum
# Length of characters in the ascii letters alphabet:
# total_sum == 12
# Length of all characters in all words:
# sum([len(w) for w in year3wordlist]) == 18
EDIT:
Since the OP comments he is trying to create a spelling bee contest, let me try to answer more specifically. The distance between a correctly spelled word and a similar string can be measured in many different ways. One of the most common ways is called 'edit distance' or 'Levenshtein distance'. This represents the number of insertions, deletions or substitutions that would be needed to rewrite the input string into the 'correct' one.
You can find that distance implemented in the Python-Levenshtein package. You can install it via pip:
$ sudo pip install python-Levenshtein
And then use it like this:
from __future__ import division
import Levenshtein
correct = 'because'
student = 'becuase'
distance = Levenshtein.distance(correct, student) # distance == 2
mark = ( 1 - distance / len(correct)) * 10 # mark == 7.14
The last line is just a suggestion on how you could derive a grade from the distance between the student's input and the correct answer.
I think what you need is join:
>>> "".join(splitalphabet)
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
join is a class method of str, you can do
''.join(splitalphabet)
or
str.join('', splitalphabet)
To convert the list splitalphabet to a string, so you can use it with the find() function you can use separator.join(iterable):
"".join(splitalphabet)
Using it in your code:
j = year3wordlist[x].find("".join(splitalphabet))
I don't know why half the answers are telling you how to put the split alphabet back together...
To count the number of characters in a word that appear in the splitalphabet, do it the functional way:
count = len([c for c in word if c in splitalphabet])
import string
# making letters a set makes "ch in letters" very fast
letters = set(string.ascii_letters)
def letters_in_word(word):
return sum(ch in letters for ch in word)
Edit: it sounds like you should look at Levenshtein edit distance:
from Levenshtein import distance
distance("because", "becuase") # => 2
While join creates the string from the split, you would not have to do that as you can issue the find on the original string (alphabet). However, I do not think is what you are trying to do. Note that the find that you are trying attempts to find the splitalphabet (actually alphabet) within year3wordlist[x] which will always fail (-1 result)
If what you are trying to do is to get the indices of all the letters of the word list within the alphabet, then you would need to handle it as
for each letter in the word of the word list, determine the index within alphabet.
j = []
for c in word:
j.append(alphabet.find(c))
print j
On the other hand if you are attempting to find the index of each character within the alphabet within the word, then you need to loop over splitalphabet to get an individual character to find within the word. That is
l = []
for c within splitalphabet:
j = word.find(c)
if j != -1:
l.append((c, j))
print l
This gives the list of tuples showing those characters found and the index.
I just saw that you talk about counting the number of letters. I am not sure what you mean by this as len(word) gives the number of characters in each word while len(set(word)) gives the number of unique characters. On the other hand, are you saying that your word might have non-ascii characters in it and you want to count the number of ascii characters in that word? I think that you need to be more specific in what you want to determine.
If what you are doing is attempting to determine if the characters are all alphabetic, then all you need to do is use the isalpha() method on the word. You can either say word.isalpha() and get True or False or check each character of word to be isalpha()

Right justify string containing Thai characters

I would like to right justify strings containing Thai characters (Thai rendering doesn't work from left to right, but can go up and down as well).
For example, for the strings ไป (two characters, length 2) and ซื้อ (four characters, length 2) I want to have the following output (length 5):
...ไป
...ซื้อ
The naive
print 'ไป'.decode('utf-8').rjust(5)
print 'ซื้อ'.decode('utf-8').rjust(5)
however, respectively produce
...ไป
.ซื้อ
Any ideas how to get to the desired formatting?
EDIT:
Given a string of Thai characters tc, I want to determine how many [places/fields/positions/whatever you want to call it] the string uses. This is not the same as len(tc); len(tc) is usually larger than the number of places used. The second word gives len(tc) = 4, but has length 2 / uses 2 places / uses 2 positions.
Cause
Thai script contains normal characters (positive advance width) and non-spacing marks as well (zero advance width).
For example, in the word ซื้อ:
the first character is the initial consonant "SO SO",
then it has vowel mark SARA UUE,
then tone mark MAI THO,
and then the final pseudo-consonant O ANG
The problem is that characters ##2 and 3 in the list above are zero-width ones.
In other words, they do not make the string "wider".
In yet other words, ซื้อ ("to buy") and ซอ ("fiddle") would have equal width of two character places (but string lengths of 4 and 2, correspondingly).
Solution
In order to calculate the "real" string length, one must skip zero-width characters.
Python-specific
The unicodedata module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 8.0.0.
The unicodedata.category(unichr) method returns one the following General Category Values:
"Lo" for normal character;
"Mn" for zero-width non-spacing marks;
The rest is obvious, simply filter out the latter ones.
Further info:
Unicode data for Thai script (scroll till the first occurrence of "THAI CHARACTER")
I think what you mean to ask is, how to determine the 'true' # of characters in เรือ, ไป, ซื้อ etc. (which are 3,2 and 2, respectively)
Unfortunately, here's how Python interprets these characters:
ไป
>>> 'ไป'
'\xe0\xb9\x84\xe0\xb8\x9b'
>>> len('ไป')
6
>>> len('ไป'.decode('utf-8'))
2
ซื้อ
>>> 'ซื้อ'
'\xe0\xb8\x8b\xe0\xb8\xb7\xe0\xb9\x89\xe0\xb8\xad'
>>> len('ซื้อ')
12
>>> len('ซื้อ'.decode('utf-8'))
4
เรือ
>>> 'เรือ'
'\xe0\xb9\x80\xe0\xb8\xa3\xe0\xb8\xb7\xe0\xb8\xad'
>>> len('เรือ')
12
>>> len('เรือ'.decode('utf-8'))
4
There's no real correlation between the # of characters displayed and the # of actual (from Python's perspective) characters that make up the string.
I can't think of an obvious way to do this. However, I've found this library which might be of help to you. (You will also need to install some prequisites.
It looks like the rjust() function will not work for you and you will need to count the number of cells in the string yourself. You can then insert the number of spaces required before the string to achieve justification
You seem to know about Thai language. Sum the number of consonants, preceding vowels, following vowels and Thai punctuation. Don't count diacritics and above and below vowels.
Something like (forgive my pseudo Python code),
cells = 0
for i in range (0, len(string))
if (string[i] == \xe31) or ((string[i] >= \xe34) and (string[i] <= \xe3a)) or ((string[i] >= \xe47) and (string[i] <= \xe4e))
# do nothing
else
# consonant, preceding or following vowel or punctuation
cells++
Here's a function to compute the length of a thai string (the number of characters arranged horizontally), based on bytebuster's answer
import unicodedata
def get_thai_string_length(string):
length = 0
for c in string:
if unicodedata.category(c) != 'Mn':
length += 1
return length
print(len('บอินทัช'))
print(get_thai_string_length('บอินทัช'))

How do the count the number of sentences, words and characters in a file?

I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this.
>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
... words=nltk.tokenize.word_tokenize(each_sentence)
... print each_sentence #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
... words=nltk.tokenize.word_tokenize(each_word)
... print each_words #prints tokenized words from samp.txt
Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath):
import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')
print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])
Hope this helps
With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1)
And in your case:
import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()
For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the textstat package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.
import textstat
your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))
I believe this to be the right solution because it properly counts things like "..." and "??" as a single sentence
len(re.findall(r"[^?!.][?!.]", paragraph))
Characters are easy to count.
Paragraphs are usually easy to count too. Whenever you see two consecutive newlines you probably have a paragraph. You might say that an enumeration or an unordered list is a paragraph, even though their entries can be delimited by two newlines each. A heading or a title too can be followed by two newlines, even-though they're clearly not paragraphs. Also consider the case of a single paragraph in a file, with one or no newlines following.
Sentences are tricky. You might settle for a period, exclamation-mark or question-mark followed by whitespace or end-of-file. It's tricky because sometimes colon marks an end of sentence and sometimes it doesn't. Usually when it does the next none-whitespace character would be capital, in the case of English. But sometimes not; for example if it's a digit. And sometimes an open parenthesis marks end of sentence (but that is arguable, as in this case).
Words too are tricky. Usually words are delimited by whitespace or punctuation marks. Sometimes a dash delimits a word, sometimes not. That is the case with a hyphen, for example.
For words and sentences you will probably need to clearly state your definition of a sentence and a word and program for that.
Not 100% correct but I just gave a try. I have not taken all points by #wilhelmtell in to consideration. I try them once I have time...
if __name__ == "__main__":
f = open("1.txt")
c=w=0
s=1
prevIsSentence = False
for x in f:
x = x.strip()
if x != "":
words = x.split()
w = w+len(words)
c = c + sum([len(word) for word in words])
prevIsSentence = True
else:
if prevIsSentence:
s = s+1
prevIsSentence = False
if not prevIsSentence:
s = s-1
print "%d:%d:%d" % (c,w,s)
Here 1.txt is the file name.
The only way you can solve this is by creating an AI program that uses Natural Language Processing which is not very easy to do.
Input:
"This is a paragraph about the Turing machine. Dr. Allan Turing invented the Turing Machine. It solved a problem that has a .1% change of being solved."
Checkout OpenNLP
https://sourceforge.net/projects/opennlp/
http://opennlp.apache.org/
There's already a program to count words and characters-- wc.

Categories