def most_frequency_occ(chars,inputString):
count = 0
for ind_char in inputString:
ind_char = ind_char.lower()
if chars == ind_char:
count += 1
return count
def general(inputString):
maxOccurences = 0
for chars in inputString:
most_frequency_occ(chars, inputString)
This is my current code. I'm trying to find the most frequent occurring letter in general. I created another function called most_frequency_occ that finds a specific character in the string that occurs the most often, but how do I generalize it into finding the frequent letter in a string without specifying a specific character and only using loops, without any build in string functions either.
For example:
print(general('aqweasdaza'))
should print 4 as "a" occurs the most frequently, occurring 4 times.
If I got your task, I think that using a dictionary will be more comfortable for you.
# initializing string
str = "Hello world"
# initializing dict of freq
freq = {}
for i in str:
if i in freq:
freq[i] += 1
else:
freq[i] = 1
# Now, you have the count of every char in this string.
# If you want to extract the max, this step will do it for you:
max_freq_chr = max(stats.values())
There are multiple ways you find the most common letter in a string.
One easy to understand and cross-language way of doing this would be:
initialize an array of 26 integers set to 0.
go over each letter one by one of your string, if the first letter is an B (B=2), you can increment the second value of the array
Find the largest value in your array, return the corresponding letter.
Since you are using python, you could use dictionaries since it would be less work to implement.
A word of caution, it sounds like you are doing a school assignment. If your school has a plagiarism checker that checks the internet, you might be caught for academic dishonesty if you copy paste code from the internet.
The other answers have suggested alternative ways of counting the letters in a string, some of which may be better than what you've come up with on your own. But I think it may be worth answering your question about how to call your most_frequency_occ function from your general function even if the algorithm isn't great, since you'll need to understand how functions work in other contexts.
The thing to understand about function calls is that the call expression will be evaluated to the value returned by the function. In this case, that's the count. Often you may want to assign the return value to a variable so you can reference it multiple times. Here's what that might look like:
count = most_frequency_occ(chars, inputString)
Now you can do a comparsion between the count and the previously best count to see if you've just checked the most common letter so far:
maxOccurences = 0
for chars in inputString:
count = most_frequency_occ(chars, inputString)
if count > maxOccurences: # check if chars is more common than the previous best
maxOccurences = count
return maxOccurences
One final note: Some of your variable and function names are a bit misleading. That often happens when you're changing your code around from one design to another, but not changing the variable names at the same time. You may want to occasionally reread your code and double check to make sure that the variable names still match what you're doing with them. If not, you should "refactor" your code by renaming the variables to better match their actual uses.
To be specific, your most_frequency_occ function isn't actually finding the most frequent character itself, it's only doing a small step in that process, counting how often a single character occurs. So I'd call it count_char or something similar. The general function might be named something more descriptive like find_most_frequent_character.
And the variable chars (which exists in both functions) is also misleading since it represents a single character, but the name chars implies something plural (like a list or a string that contains several characters). Renaming it to char might be better, as that seems more like a singular name.
Related
I have a simple function which when given an input like (x,y), it will return {{x},{x,y}}.
In the cases that x=y, it naturally returns {{x},{x,x}}.
I can't figure out how to get Regex to substitute 'x' in place of 'x,x'. But even if I could figure out how to do this, the expression would go from {{x},{x,x}} to {{x},{x}}, which itself would need to be substituted for {{x}}.
The closest I have gotten has been:
re.sub('([0-9]+),([0-9]+)',r'\1',string)
But this function will also turn {{x},{x,y}} into {{x},{x}}, which is not desired. Also you may notice that the function searches for numbers only, which is fine because I really only intend to be using numbers in the place of x and y; however, if there is a way to get it to work with any letter as well (lower case or capital) the would be even more ideal.
Note also that if I give my original function (x,y,z) it will read it as ((x,y),z) and thus return {{{{x},{x,y}}},{{{x},{x,y}},z}}, thus in the case that x=y=z, I would want to be able to have a Regex function call itself repeatedly to reduce this to {{{{x}}},{{{x}},x}} instead of {{{{x},{x,x}}},{{{x},{x,x}},x}}.
If it helps at all, this is essentially an attempt at making a translation (into sets) using the Kuratowski definition of an ordered pair.
Essentially to solve this you need recursion, or more simply, keep applying the regex in a loop until the replacement doesn't change the input string. For example using your regex from https://regex101.com/r/Yl1IJv/4:
s = '{{ab},{ab,ab}}'
while True:
news = re.sub(r'(?P<first>.?(\w+|\d+).?),(?P=first)', r'\g<1>', s, 0)
if news == s:
break
s = news
print(s)
Output
{{ab}}
Demo on rextester
With
s = '{{{{x},{x,x}}},{{{x},{x,x}},x}}'
The output is
{{{{x}}},{{{x}},x}}
as required. Demo on rextester
I'm relatively new to Python, and I keep seeing examples like:
def max_wordnum(texts):
count = 0
for text in texts:
if len(text.split()) > count:
count = len(text.split())
return count
Is the repeated len(text.split()) somehow optimized away by the interpreter/compiler in Python, or will this just take twice the CPU cycles of storing len(text.split()) in a variable?
Duplicate expressions are not "somehow optimized away". Use a local variable to capture and re-use a result that is 'known not to change' and 'takes some not-insignificant time' to create; or where using a variable increases clarity.
In this case, it's impossible for Python to know that 'text.split()' is pure - a pure function is one with no side-effects and always returns the same value for the given input.
Trivially: Python, being a dynamically-typed language, doesn't even know the type of 'text' before it actually gets a value, so generalized optimization of this kind is not possible. (Some classes may provide their own internal 'cache optimizations', but digressing..)
As: even a language like C#, with static typing, won't/can't optimize away general method calls - as, again, there is no basic enforceable guarantee of purity in C#. (ie. What if the method returned a different value on the second call or wrote to the console?)
But: a Haskell, a Purely Functional language, has the option to not 'evaluate' the call twice, being a different language with different rules...
Even if python did optimize this (which isn't the case), the code is copy/paste all over and more difficult to maintain, so creating a variable to hold the result of a complex computation is always a good idea.
A better idea yet is to use max with a key function in this case:
return max(len(text.split()) for text in texts)
this is also faster.
Also note that len(text.split()) creates a list and you just count the items. A better way would be to count the spaces (if words are separated by only one space) by doing
return max(text.count(" ") for text in texts) + 1
if there can be more than 1 space, use regex and finditer to avoid creating lists:
return max(sum(1 for _ in re.finditer("\s+",text)) for text in texts) + 1
note the 1 value added in the end to correct the value (number of separators is one less than the number of words)
As an aside, even if the value isn't cached, you still can use complex expressions in loops with range:
for i in range(len(text.split())):
the range object is created at the start, and the expression is only evaluated once (as opposed as C loops for instance)
I'm trying to write an iterative LL(k) parser, and I've gotten strings down pretty well, because they have a start and end token, and so you can just "".join(tokenlist[string_start:string_end]).
Numbers, however, do not, and only consist of .0123456789. They can occur at any given point in a program, have any arbitrary length and are delimited purely by non-numerals.
Some examples, because that definition is pretty vague:
56 123.45/! is 56 and 123.45 followed by two other tokens
565.5345.345 % is 565.5345, 0.345 and two other tokens (incl. whitespace)
The problem I'm trying to solve is how the parser should figure out where a numeric literal ends. (Note that this is a context-free, self-modifying interpretive grammar thus there is no separate lexical analysis to be done.)
I could and have solved this with iteration:
def _next_notinst(self, atindex, subs = DIGITS):
"""return the next index of a char not in subs"""
for i, e in enumerate(self.toklist[atindex:]):
if e not in subs:
return i - len(self.toklist)
else:
break
return self.idx.v
(I don't think I need to clarify the variables, since it's an example and extremely straightforward.)
Great! That works, but there are at least two issues:
It's O(n) for a number with digit-length n. Not ideal.*
The parser class of which this method is a member is already using a while True: to cycle over arbitrary parts of the string, and I would prefer not having remotely nested loops when I don't need to.
From the previous bullet: since the parser uses arbitrary k lookahead and skipahead, parsing each individual token is absolutely not what I want.
I don't want to use RegEx mostly because I don't know it, and using it for this right now would make my code uncomprehendable to me, its creator.
There must be a simple, < O(n) solution to this, that simply collects the contiguous numerals in a string given a starting point, up until a non-numeral.
*Yes, I'm fully aware the parser itself is O(n), but we don't also need the number catenator to be > O(n). If you don't believe me, the string catenator is O(1) because it simply looks for the next unescaped " in the program and then joins all the chars up to that. Can't I do the same thing for numbers?
My other answer was actually erroneous due to lack of testing.
I decided to suck it up and learn a little bit of RegEx just because it's the only other way to solve this.
^([.\d]+[.\d]+|[.\d]) works for what I want, and matches these:
123.43.453""
.234234!/%
but not, for example:
"1233
I'm doing this week's 'easy' Daily Programmer Challenge on Reddit. The description is at the link, but essentially the challenge is to read a text file from a url and do a word count. Needless to say the resulting output is a fairly large dictionary object. I have a few questions, mostly regarding accessing or sorting keys according to their value.
First, I developed the code according to what I currently understand about OOP and good Python style. I wanted it to be as robust as possible but I also wanted to use the least amount of imported modules. My goal is to become a good programmer, thus I believe it's important to develop a strong foundation and figure out how to do things myself whenever possible. That being said, the code:
from urllib2 import urlopen
class Word(object):
def __init__(self):
self.word_count = {}
def alpha_only(self, word):
"""Converts word to lowercase and removes any non-alphabetic characters."""
x = ''
for letter in word:
s = letter.lower()
if s in 'abcdefghijklmnopqrstuvwxyz':
x += s
if len(x) > 0:
return x
def count(self, line):
"""Takes a line from the file and builds a list of lowercased words containing only alphabetic chars.
Adds each word to word_count if not already present, if present increases the count by 1."""
words = [self.alpha_only(x) for x in line.split(' ') if self.alpha_only(x) != None]
for word in words:
if word in self.word_count:
self.word_count[word] += 1
elif word != None:
self.word_count[word] = 1
class File(object):
def __init__(self,book):
self.book = urlopen(book)
self.word = Word()
def strip_line(self,line):
"""Strips newlines, tabs, and return characters from beginning and end of line. If remaining string > 1,
splits up the line and passes it along to the count method of the word object."""
s = line.strip('\n\r\t')
if s > 1:
self.word.count(s)
def process_book(self):
"""Main processing loop, will not begin processing until the first line after the line containing "START".
After processing it will close the file."""
begin = False
for line in self.book:
if begin == True:
self.strip_line(line)
elif 'START' in line:
begin = True
self.book.close()
book = File('http://www.gutenberg.org/cache/epub/47498/pg47498.txt')
book.process_book()
count = book.word.word_count
So now I have a fairly accurate and robust word count that probably doesn't have any duplicates or blank entries, but is nevertheless a dict object containing over 3k key/value pairs. I can't iterate over it using for k,v in count or it gives me the exception ValueError: too many values to unpack, which rules out using list comprehension or mapping to a function to perform any kind of sorting.
I was reading this HowTo on Sorting and playing with it a few minutes ago and noticed that for x in count.items() lets me iterate through a list of key/value pairs without throwing a ValueError exception, so I removed the line count = book.word.word_count and added the following:
s_count = sorted(book.word.word_count.items(), key=lambda count: count[1], reverse=True)
# Delete the original dict, it is no longer needed
del book.word.word_count
Now I finally have a sorted list of words, s_count. PHEW! So, my questions are:
Is a dict even the best data type to perform the original counting? Would a list of tuples like that returned by count.items() have been preferable? But that would probably slow it down, right?
This seems kind of 'clunky', as I'm building a dict, converting it to a list containing tuples, then sorting the list and returning a new list. However, it is my understanding that dictionaries allow me to perform the fastest lookups, so am I missing something here?
I read briefly about hashing. While I think I understand that the point is that hashing will save space in memory and allow me to perform faster look-ups and comparisons, wouldn't the trade off be that the program becomes more computationally expensive(higher CPU load) because it would then be calculating hashes for each word? Is hashing relevant here?
Any feedback on naming conventions (which I am terrible at), or any other suggestions about basically anything (including style), would be greatly appreciated.
Are you sure that for k,v in count: gives the exception ValueError: too many values to unpack? I expect it to give ValueError: need more than 1 value to unpack.
When you use a dict as an iterator (eg in a for loop) you just get the keys, you don't get the values. If you want key, value pairs you need to use the dict's iteritems() method as mentioned by figs in the comment (or in Python 3 the items() method).
Of course, you can always do something like:
for k in count:
print k, count[k]
...
I think that most of your questions are more suited to Code Review than to Stack Overflow. But since you've asked so nicely here, I'll mention a few points. :)
It's rather inefficient to build up a string char by char, so your alpha_only() method would be better if it collected chars in a list then used the str.join() method to join them into a single string. The usual Python idiom would do that using a list comprehension.
The list comprehension in your count() method calls alpha_only() twice for each word, which is in efficient.
You could make your strip() call simpler by using the default argument, as that strips all white space (and you don't need to preserve space chars in this application). Similarly, using split() with its default arg will split on any runs of blank space, which is probably better in this application, since giving an arg of a single space means that you'll get some empty strings in the list returned by split if there are any runs of multiple spaces within a line.
...
You mention hashing in your question, and whether it's useful for this application. Yes, it is. Python dictionaries actually use hashing of their keys, so you don't need to worry about the details. And yes, a dictionary is a good data structure to use for this task. There are fancier forms of dictionary that make things a bit simpler, but to use them does require importing a (standard) module. But using a dictionary (of some flavour or another) to hold data and then generating a list of tuples from it for final sorting is a fairly common practice in Python. And there's no need to specifically delete the dictionary when you've finished with it if the program's about to terminate anyway.
...
As for the duplicated call of alpha_only(), whenever you find yourself doing that sort of thing it's a sign that a list comprehension isn't really suitable for the task and that you should just use a normal for loop so that you can save the result of the function call rather than having to recalculate it. Eg,
words = []
for word in line.split():
word = self.alpha_only(word)
if word is not None:
words.append(word)
I'm tackling project euler's problem 220 (looked easy, in comparison to some of the
others - thought I'd try a higher numbered one for a change!)
So far I have:
D = "Fa"
def iterate(D,num):
for i in range (0,num):
D = D.replace("a","A")
D = D.replace("b","B")
D = D.replace("A","aRbFR")
D = D.replace("B","LFaLb")
return D
instructions = iterate("Fa",50)
print instructions
Now, this works fine for low values, but when you put it to repeat higher then you just get a "Memory error". Can anyone suggest a way to overcome this? I really want a string/file that contains instructions for the next step.
The trick is in noticing which patterns emerge as you run the string through each iteration. Try evaluating iterate(D,n) for n between 1 and 10 and see if you can spot them. Also feed the string through a function that calculates the end position and the number of steps, and look for patterns there too.
You can then use this knowledge to simplify the algorithm to something that doesn't use these strings at all.
Python strings are not going to be the answer to this one. Strings are stored as immutable arrays, so each one of those replacements creates an entirely new string in memory. Not to mention, the set of instructions after 10^12 steps will be at least 1TB in size if you store them as characters (and that's with some minor compressions).
Ideally, there should be a way to mathematically (hint, there is) generate the answer on the fly, so that you never need to store the sequence.
Just use the string as a guide to determine a method which creates your path.
If you think about how many "a" and "b" characters there are in D(0), D(1), etc, you'll see that the string gets very long very quickly. Calculate how many characters there are in D(50), and then maybe think again about where you would store that much data. I make it 4.5*10^15 characters, which is 4500 TB at one byte per char.
Come to think of it, you don't have to calculate - the problem tells you there are 10^12 steps at least, which is a terabyte of data at one byte per character, or quarter of that if you use tricks to get down to 2 bits per character. I think this would cause problems with the one-minute time limit on any kind of storage medium I have access to :-)
Since you can't materialize the string, you must generate it. If you yield the individual characters instead of returning the whole string, you might get it to work.
def repl220( string ):
for c in string:
if c == 'a': yield "aRbFR"
elif c == 'b': yield "LFaLb"
else yield c
Something like that will do replacement without creating a new string.
Now, of course, you need to call it recursively, and to the appropriate depth. So, each yield isn't just a yield, it's something a bit more complex.
Trying not to solve this for you, so I'll leave it at that.
Just as a word of warning be careful when using the replace() function. If your strings are very large (in my case ~ 5e6 chars) the replace function would return a subset of the string (around ~ 4e6 chars) without throwing any errors.
You could treat D as a byte stream file.
Something like:-
seedfile = open('D1.txt', 'w');
seedfile.write("Fa");
seedfile.close();
n = 0
while (n
warning totally untested