Calculating the Letter Frequency in Python

Calculating the Letter Frequency in Python - python

I need to define a function that will slice a string according to a certain character, sum up those indices, divide by the number of times the character occurs in the string and then divide all that by the length of the text.
Here's what I have so far:
def ave_index(char):
passage = "string"
if char in passage:
word = passage.split(char)
words = len(word)
number = passage.count(char)
answer = word / number / len(passage)
return(answer)
elif char not in passage:
return False
So far, the answers I've gotten when running this have been quite off the mark
EDIT: The passage we were given to use as a string -
'Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.'
when char = 's' the answer should be 0.5809489252885479

You can use Counter to check frequencies:
from collections import Counter
words = 'The passage we were given to use as a string - Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people\'s hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.'
freqs = Counter(list(words)) # list(words) returns a list of all the characters in words, then Counter will calculate the frequencies
print(float(freqs['s']) / len(words))

The problem is how you are counting the letters. Take the string hello world and you are trying to count how many l there are. Now we know there are 3 l, but if you do a split:
>>> s.split('l')
['he', '', 'o wor', 'd']
This will result in a count of 4. Further, we have to get the position of each instance of the character in the string.
The enumerate built-in helps us out here:
>>> s = 'hello world'
>>> c = 'l' # The letter we are looking for
>>> results = [k for k,v in enumerate(s) if v == c]
>>> results
[2, 3, 9]
Now we have the total number of occurrences len(results), and the positions in the string where the letter occurs.
The final "trick" to this problem is to make sure you divide by a float, in order to get the proper result.
Working against your sample text (stored in s):
>>> c = 's'
>>> results = [k for k,v in enumerate(s) if v == c]
>>> results_sum = sum(results)
>>> (results_sum / len(results)) / float(len(s))
0.5804132973944295

Related

how can i concat these string vars in a for loop? [duplicate]

This question already has answers here:
How can I select a variable by (string) name?
(5 answers)
How to concatenate (join) items in a list to a single string
(11 answers)
Closed 8 months ago.
string1 = "The wind, "
string2 = "which had hitherto carried us along with amazing rapidity, "
string3 = "sank at sunset to a light breeze; "
string4 = "the soft air just ruffled the water and "
string5 = "caused a pleasant motion among the trees as we approached the shore, "
string6 = "from which it wafted the most delightful scent of flowers and hay."
I tried:
for i in range(6):
message +=string(i)
but it didn't work and showed the error: string is undefined
I want to manipulate the vars directly, I know that putting them in a list is much easier, but imagine if you had like 1000 string, kinda difficult to write each one in the list.

Using join():
cList = [string1, string2, string3, string4, string5, string6]
print("".join(cList))
What I'd suggest, instead of a n number of variables, have them in a list:
x = ["The wind, ", "which had hitherto carried us along with amazing rapidity, ", "sank at sunset to a light breeze; ", "the soft air just ruffled the water and ", "caused a pleasant motion among the trees as we approached the shore", "from which it wafted the most delightful scent of flowers and hay."]
print("".join(x))
One-liner:
print("".join([string1, string2, string3, string4, string5, string6]))

If you can put the strings in a list in the first place, so much the better. Otherwise:
for s in [string1, string2, string3, string4, string5, string6]:
message += s

You are using different variables. You would have to individually call each variable to be able to concatenate them, because you are trying to call them like you would in a list. Try adding them to an array, or list.

maybe you were looking for eval ?
message = ''.join([eval('string'+str(i)) for i in range(1,7)])

There are various solutions for this problem, one was given in the comments. The reason you are getting that error is because string doesn't exist. You are calling string(i), What exactly are you expecting that to do?
Also, the loop you are doing doesn't have the right logic. When coding and not getting the expected result, the first line of defence is to debug. In this case, understand what you are looping through, which essentially is numbers. Go ahead and print that i variable so you can see what's happening. You are accessing your stringX variables at all. They need to be contained in an iterable in order for you to iterate them. Not to mention that the for loop iteration is wrong since range(x) provides numbers from 0 to x-1, which in your case would be 0 1 2 3 4 5. You would have known that if you had debugged. Part of coding, I must say, is debugging. It's something good to get used to doing.
Here's the documentation on Python about strings.
string.join(words[, sep])
Concatenate a list or tuple of words with intervening occurrences of sep. The default value for sep is a single space character. It is always true that string.join(string.split(s, sep), sep) equals s.
That means you can use the string's method join to concatenate strings. The method requries you pass it a list of string. Your code would look like this:
string1 = "The wind, "
string2 = "which had hitherto carried us along with amazing rapidity, "
string3 = "sank at sunset to a light breeze; "
string4 = "the soft air just ruffled the water and "
string5 = "caused a pleasant motion among the trees as we approached the shore, "
string6 = "from which it wafted the most delightful scent of flowers and hay."
message = "".join([string1, string2, string3, string4, string5, string6])
print(message)
Output:
The wind, which had hitherto carried us along with amazing rapidity, sank at sunset to
a light breeze; the soft air just ruffled the water and caused a pleasant motion among the
trees as we approached the shore, from which it wafted the most delightful scent of
flowers and hay.
Since I am not sure what you goal is, I am going to assume, for the sake of having something different, that you have an arbitrary number of string variables passed to you. That's even easier to handle, because if you define a method called join_strings the value of the variables passed is already a list. Neat, right? So your code would be something like:
def join_strings(*strings):
return "".join(strings)
Incredibly short and sweet isn't it? Then you would call that method like this:
join_strings(string1, string2, string3, string4, string5, string6)
The cool thing is that tomorrow you might have only 3 strings, or 8, and that still works. Of course, it'd be more helpful to know why you are even saving the strings like that, since I'm sure you can use a better suited data structure for your needs (like using a list to begin with).
Next time you post to StackOverflow, it's good to try and show some effort of what you have tried to do to understand your problem instead of just pasting the problem.

Just write your story in one go (note the lack of commas between the strings):
message = ("The wind, "
"which had hitherto carried us along with amazing rapidity, "
"sank at sunset to a light breeze; "
"the soft air just ruffled the water and "
"caused a pleasant motion among the trees as we approached the shore, "
"from which it wafted the most delightful scent of flowers and hay.")
Or if you want to type less quotes:
import inspect
message = """
The wind,
which had hitherto carried us along with amazing rapidity,
sank at sunset to a light breeze;
the soft air just ruffled the water and
caused a pleasant motion among the trees as we approached the shore,
from which it wafted the most delightful scent of flowers and hay.
"""
message = inspect.cleandoc(message) # remove unwanted indentation
message = message.replace('\n', ' ') # remove the newlines

To many lists of Unique Words

This is a homework project from last week. I had problems so did not turn it it. But I like to go back and see if I can make them work. Now that I have it printing the right words in alphabetical order. I have the problem that it is printing 3 separate lists of unique words all with different number of words in the lists. How can I fix this?
import string
def process_line(line_str,word_set):
line_str=line_str.strip()
list_of_words=line_str.split()
for word in list_of_words:
if word!="--":
word=word.strip()
word=word.strip(string.punctuation)
word=word.lower()
word_set.add(word)
def pretty_print(word_set):
list_of_words=[]
for w in word_set:
list_of_words.append(w)
list_of_words.sort()
for w in list_of_words:
print(w,end=" ")
word_set=set([])
fObject=open("gettysburg.txt")
for line_str in fObject:
process_line(line_str,word_set)
print("\nlength of the word set: ",len(word_set))
print("\nUnique words in set: ")
pretty_print(word_set)
Below is the output I get, I only want it to give me the last one with the 138 words. Appreciate any help.
length of the word set: 29
Unique words in set:
a ago all and are brought conceived continent created dedicated equal fathers forth four in liberty men nation new on our proposition score seven that the this to years
length of the word set: 71
Unique words in set:
a ago all altogether and any are as battlefield brought can civil come conceived continent created dedicate dedicated do endure engaged equal fathers field final fitting for forth four gave great have here in is it liberty live lives long men met might nation new now of on or our place portion proper proposition resting score seven should so testing that the their this those to war we whether who years
length of the word set: 138
Unique words in set:
a above add advanced ago all altogether and any are as battlefield be before birth brave brought but by can cause civil come conceived consecrate consecrated continent created dead dedicate dedicated detract devotion did died do earth endure engaged equal far fathers field final fitting for forget forth fought four freedom from full gave god government great ground hallow have here highly honored in increased is it larger last liberty little live lives living long measure men met might nation never new nobly nor not note now of on or our people perish place poor portion power proper proposition rather remaining remember resolve resting say score sense seven shall should so struggled take task testing that the their these they this those thus to under unfinished us vain war we what whether which who will work world years

Take last 3 lines out of for:
....
for line_str in fObject:
process_line(line_str,word_set)
print("\nlength of the word set: ",len(word_set))
print("\nUnique words in set: ")
pretty_print(word_set)

sub-string text with a running number

This should be very simple and short, but i cannot think of a good and short way of doing this:
I have a string for instance:
'How many roads must a man walk down Before you call him a man? How
many seas must a white dove sail Before she sleeps in the sand? Yes,
and how many times must the cannon balls fly Before they're forever
banned?'
and I want to substring a word say "how" with a running number so i get:
'[1] many roads must a man walk down Before you call him a man? [2]
many seas must a white dove sail Before she sleeps in the sand? Yes,
and [3] many times must the cannon balls fly Before they're forever
banned?'

You can utilise itertools.count and a function as the replacement argument, eg:
import re
from itertools import count
text = '''How many roads must a man walk down Before you call him a man? How many seas must a white dove sail Before she sleeps in the sand? Yes, and how many times must the cannon balls fly Before they're forever banned?'''
result = re.sub(r'(?i)\bhow\b', lambda m, c=count(1): '[{}]'.format(next(c)), text)
# [1] many roads must a man walk down Before you call him a man? [2] many seas must a white dove sail Before she sleeps in the sand? Yes, and [3] many times must the cannon balls fly Before they're forever banned?

You can use re.sub with a replacement function. The function will look up how often that word has been seen in a dictionary and return an according number.
counts = collections.defaultdict(int)
def subst_count(match):
word = match.group().lower()
counts[word] += 1
return "[%d]" % counts[word]
Example:
>>> text = "How many ...? How many ...? Yes, and how many ...?"
>>> re.sub(r"\bhow\b", subst_count, text, flags=re.I)
'[1] many ...? [2] many ...? Yes, and [3] many ...?'
Note: This uses different counts for each word to replace (in case you use a regex that matched more than one word), but will not reset counts between calls to re.sub.

Here's another way to do it with re.sub with a replacement function. But rather than using a global object to keep track of the count this code uses a function attribute.
import re
def count_replace():
def replace(m):
replace.count += 1
return '[%d]' % replace.count
replace.count = 0
return replace
src = '''How many roads must a man walk down Before you call him a man? How many seas must a white dove sail Before she sleeps in the sand? Yes, and how many times must the cannon balls fly Before they're forever banned?'''
pat = re.compile('how', re.I)
print(pat.sub(count_replace(), src))
output
[1] many roads must a man walk down Before you call him a man? [2]
many seas must a white dove sail Before she sleeps in the sand? Yes,
and [3] many times must the cannon balls fly Before they're forever
banned?
If you need to only replace complete words and not partial words, then you'll need a smarter regex, eg r"\bhow\b".

Test = 'How many roads must a man walk down Before you call him a man? How many seas must a white dove sail Before she sleeps in the sand? Yes, and how many times must the cannon balls fly Before theyre forever banned?'
i = 0
while("How" in Test):
new = "["+str(i)+"]"
Test = Test.replace("How",new,i)
i=i+1
print Test
Output
[1] many roads must a man walk down Before you call him a man? [2] many seas must a white dove sail Before she sleeps in the sand? Yes, and how many times must the cannon balls fly Before theyre forever banned?

Just for fun, I wanted to see if I could solve this using recursion, and this is what I got:
def count_replace(s, to_replace, leng=0, count=1, replaced=[]):
if s.find(' ') == -1:
replaced.append(s)
return ' '.join(replaced)
else:
if s[0:s.find(' ')].lower() == to_replace.lower():
replaced.append('[%d]' % count)
count += 1
leng = len(to_replace)
else:
replaced.append(s[0:s.find(' ')])
leng = s.find(' ')
return count_replace(s[leng + 1:], to_replace, leng, count, replaced)
It goes without saying that I wouldn't recommend this as it's ridiculously inefficient on top of the fact that it's also overly complicated, but I thought I'd share it anyway!

detect allusions (e.g. very fuzzy matches) in language of inaugural addresses

I'm trying to develop a Python script to examine every sentence in Barack Obama's second inaugural address and find similar sentences in past inaugurals. I've developed a very crude fuzzy match, and I'm hoping to improve it.
I start by reducing all inaugurals to lists of stopword-free sentences. I then build a frequency index.
Next, I compare each sentence in Obama's 2013 address to each sentence of every other address, and evaluate the similarity like so:
#compare two lemmatized sentences. Assumes stop words already removed. frequencies is dict of frequencies across all inaugural
def compare(sentA, sentB, frequencies):
intersect = [x for x in sentA if x in sentB]
N = [frequencies[x] for x in intersect]
#calculate sum that weights uncommon words based on frequency inaugurals
n = sum([10.0 / (x + 1) for x in N])
#ratio of matches to total words in both sentences. (John Adams and William Harrison both favored loooooong sentences that tend to produce matches by sheer probability.)
c = float(len(intersect)) / (len(sentA) + len(sentB))
return (intersect, N, n, c)
Last, I filter out results based on arbitrary cutoffs for n and c.
It works better than one might think, identifying sentences that share uncommon words in a non-negligible proportion to total words.
For example, it picked up these matches:
Obama, 2013:
For history tells us that while these truths may be self-evident, they have never been self-executing; that while freedom is a gift from God, it must be secured by His people here on Earth.
Kennedy, 1961:
With a good conscience our only sure reward, with history the final judge of our deeds, let us go forth to lead the land we love, asking His blessing and His help, but knowing that here on earth God's work must truly be our own.
Obama, 2013
Through blood drawn by lash and blood drawn by sword, we learned that no union founded on the principles of liberty and equality could survive half-slave and half-free.
Lincoln, 1861
Yet, if God wills that it continue until all the wealth piled by the bondsman's two hundred and fifty years of unrequited toil shall be sunk, and until every drop of blood drawn with the lash shall be paid by another drawn with the sword, as was said three thousand years ago, so still it must be said "the judgments of the Lord are true and righteous altogether.
Obama, 2013
This generation of Americans has been tested by crises that steeled our resolve and proved our resilience
Kennedy, 1961
Since this country was founded, each generation of Americans has been summoned to give testimony to its national loyalty.
But it's very crude.
I don't have the chops for a major machine-learning project, but I do want to apply more theory if possible. I understand bigram searching, but I'm not sure that will work here -- it's not so much exact bigrams we're interested in as general proximity of two words that are shared between quotes. Is there a fuzzy sentence comparison that looks at probability and distribution of words without being too rigid? The nature of allusion is that it's very approximate.
Current effort available on Cloud9IDE
UPDATE, 1/24/13
Per the accepted answer, here's a simple Python function for bigram windows:
def bigrams(tokens, blur=1):
grams = []
for c in range(len(tokens) - 1):
for i in range(c + 1, min(c + blur + 1, len(tokens))):
grams.append((tokens[c], tokens[i]))
return grams

If you are inspired to use bigrams, you could build your bigrams while allowing gaps of one, two, or even three words so as to loosen up the definition of bigram a little bit. This could work since allowing n gaps means not even n times as many "bigrams", and your corpus is pretty small. With this, for example, a "bigram" from your first paragraph could be (similar, inaugurals).

How to extract literal words from a consecutive string efficiently? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to split text without spaces into list of words?
There are masses of text information in people's comments which are parsed from html, but there are no delimiting characters in them. For example: thumbgreenappleactiveassignmentweeklymetaphor. Apparently, there are 'thumb', 'green', 'apple', etc. in the string. I also have a large dictionary to query whether the word is reasonable.
So, what's the fastest way to extract these words?

I'm not really sure a naive algorithm would serve your purpose well, as pointed out by eumiro, so I'll describe a slightly more complex one.
The idea
The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.
Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.
The code
import math
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k,math.log((i+1)*math.log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
which you can use with
s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))
Examples
I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.
Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.
Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen
odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho
rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery
whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.
After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.
Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.
After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.
As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.
Optimization
The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.
If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.

"Apparently" is good for humans, not for computers…
words = set(possible words)
s = 'thumbgreenappleactiveassignmentweeklymetaphor'
for i in xrange(len(s) - 1):
for j in xrange(1, len(s) - i):
if s[i:i+j] in words:
print s[i:i+j]
For possible words in /usr/share/dict/words and for j in xrange(3, len(s) - i): (minimal words length of 3), it finds:
thumb
hum
green
nap
apple
plea
lea
act
active
ass
assign
assignment
sign
men
twee
wee
week
weekly
met
eta
tap

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.