How to work with very long strings in Python?

How to work with very long strings in Python? - python

I'm tackling project euler's problem 220 (looked easy, in comparison to some of the
others - thought I'd try a higher numbered one for a change!)
So far I have:
D = "Fa"
def iterate(D,num):
for i in range (0,num):
D = D.replace("a","A")
D = D.replace("b","B")
D = D.replace("A","aRbFR")
D = D.replace("B","LFaLb")
return D
instructions = iterate("Fa",50)
print instructions
Now, this works fine for low values, but when you put it to repeat higher then you just get a "Memory error". Can anyone suggest a way to overcome this? I really want a string/file that contains instructions for the next step.

The trick is in noticing which patterns emerge as you run the string through each iteration. Try evaluating iterate(D,n) for n between 1 and 10 and see if you can spot them. Also feed the string through a function that calculates the end position and the number of steps, and look for patterns there too.
You can then use this knowledge to simplify the algorithm to something that doesn't use these strings at all.

Python strings are not going to be the answer to this one. Strings are stored as immutable arrays, so each one of those replacements creates an entirely new string in memory. Not to mention, the set of instructions after 10^12 steps will be at least 1TB in size if you store them as characters (and that's with some minor compressions).
Ideally, there should be a way to mathematically (hint, there is) generate the answer on the fly, so that you never need to store the sequence.
Just use the string as a guide to determine a method which creates your path.

If you think about how many "a" and "b" characters there are in D(0), D(1), etc, you'll see that the string gets very long very quickly. Calculate how many characters there are in D(50), and then maybe think again about where you would store that much data. I make it 4.5*10^15 characters, which is 4500 TB at one byte per char.
Come to think of it, you don't have to calculate - the problem tells you there are 10^12 steps at least, which is a terabyte of data at one byte per character, or quarter of that if you use tricks to get down to 2 bits per character. I think this would cause problems with the one-minute time limit on any kind of storage medium I have access to :-)

Since you can't materialize the string, you must generate it. If you yield the individual characters instead of returning the whole string, you might get it to work.
def repl220( string ):
for c in string:
if c == 'a': yield "aRbFR"
elif c == 'b': yield "LFaLb"
else yield c
Something like that will do replacement without creating a new string.
Now, of course, you need to call it recursively, and to the appropriate depth. So, each yield isn't just a yield, it's something a bit more complex.
Trying not to solve this for you, so I'll leave it at that.

Just as a word of warning be careful when using the replace() function. If your strings are very large (in my case ~ 5e6 chars) the replace function would return a subset of the string (around ~ 4e6 chars) without throwing any errors.

You could treat D as a byte stream file.
Something like:-
seedfile = open('D1.txt', 'w');
seedfile.write("Fa");
seedfile.close();
n = 0
while (n
warning totally untested

Related

How to call another function's results

def most_frequency_occ(chars,inputString):
count = 0
for ind_char in inputString:
ind_char = ind_char.lower()
if chars == ind_char:
count += 1
return count
def general(inputString):
maxOccurences = 0
for chars in inputString:
most_frequency_occ(chars, inputString)
This is my current code. I'm trying to find the most frequent occurring letter in general. I created another function called most_frequency_occ that finds a specific character in the string that occurs the most often, but how do I generalize it into finding the frequent letter in a string without specifying a specific character and only using loops, without any build in string functions either.
For example:
print(general('aqweasdaza'))
should print 4 as "a" occurs the most frequently, occurring 4 times.

If I got your task, I think that using a dictionary will be more comfortable for you.
# initializing string
str = "Hello world"
# initializing dict of freq
freq = {}
for i in str:
if i in freq:
freq[i] += 1
else:
freq[i] = 1
# Now, you have the count of every char in this string.
# If you want to extract the max, this step will do it for you:
max_freq_chr = max(stats.values())

There are multiple ways you find the most common letter in a string.
One easy to understand and cross-language way of doing this would be:
initialize an array of 26 integers set to 0.
go over each letter one by one of your string, if the first letter is an B (B=2), you can increment the second value of the array
Find the largest value in your array, return the corresponding letter.
Since you are using python, you could use dictionaries since it would be less work to implement.
A word of caution, it sounds like you are doing a school assignment. If your school has a plagiarism checker that checks the internet, you might be caught for academic dishonesty if you copy paste code from the internet.

The other answers have suggested alternative ways of counting the letters in a string, some of which may be better than what you've come up with on your own. But I think it may be worth answering your question about how to call your most_frequency_occ function from your general function even if the algorithm isn't great, since you'll need to understand how functions work in other contexts.
The thing to understand about function calls is that the call expression will be evaluated to the value returned by the function. In this case, that's the count. Often you may want to assign the return value to a variable so you can reference it multiple times. Here's what that might look like:
count = most_frequency_occ(chars, inputString)
Now you can do a comparsion between the count and the previously best count to see if you've just checked the most common letter so far:
maxOccurences = 0
for chars in inputString:
count = most_frequency_occ(chars, inputString)
if count > maxOccurences: # check if chars is more common than the previous best
maxOccurences = count
return maxOccurences
One final note: Some of your variable and function names are a bit misleading. That often happens when you're changing your code around from one design to another, but not changing the variable names at the same time. You may want to occasionally reread your code and double check to make sure that the variable names still match what you're doing with them. If not, you should "refactor" your code by renaming the variables to better match their actual uses.
To be specific, your most_frequency_occ function isn't actually finding the most frequent character itself, it's only doing a small step in that process, counting how often a single character occurs. So I'd call it count_char or something similar. The general function might be named something more descriptive like find_most_frequent_character.
And the variable chars (which exists in both functions) is also misleading since it represents a single character, but the name chars implies something plural (like a list or a string that contains several characters). Renaming it to char might be better, as that seems more like a singular name.

How can I get Regex to remove redundancies and call itself again?

I have a simple function which when given an input like (x,y), it will return {{x},{x,y}}.
In the cases that x=y, it naturally returns {{x},{x,x}}.
I can't figure out how to get Regex to substitute 'x' in place of 'x,x'. But even if I could figure out how to do this, the expression would go from {{x},{x,x}} to {{x},{x}}, which itself would need to be substituted for {{x}}.
The closest I have gotten has been:
re.sub('([0-9]+),([0-9]+)',r'\1',string)
But this function will also turn {{x},{x,y}} into {{x},{x}}, which is not desired. Also you may notice that the function searches for numbers only, which is fine because I really only intend to be using numbers in the place of x and y; however, if there is a way to get it to work with any letter as well (lower case or capital) the would be even more ideal.
Note also that if I give my original function (x,y,z) it will read it as ((x,y),z) and thus return {{{{x},{x,y}}},{{{x},{x,y}},z}}, thus in the case that x=y=z, I would want to be able to have a Regex function call itself repeatedly to reduce this to {{{{x}}},{{{x}},x}} instead of {{{{x},{x,x}}},{{{x},{x,x}},x}}.
If it helps at all, this is essentially an attempt at making a translation (into sets) using the Kuratowski definition of an ordered pair.

Essentially to solve this you need recursion, or more simply, keep applying the regex in a loop until the replacement doesn't change the input string. For example using your regex from https://regex101.com/r/Yl1IJv/4:
s = '{{ab},{ab,ab}}'
while True:
news = re.sub(r'(?P<first>.?(\w+|\d+).?),(?P=first)', r'\g<1>', s, 0)
if news == s:
break
s = news
print(s)
Output
{{ab}}
Demo on rextester
With
s = '{{{{x},{x,x}}},{{{x},{x,x}},x}}'
The output is
{{{{x}}},{{{x}},x}}
as required. Demo on rextester

Finding the end of a contiguous substring of a string without iteration or RegEx

I'm trying to write an iterative LL(k) parser, and I've gotten strings down pretty well, because they have a start and end token, and so you can just "".join(tokenlist[string_start:string_end]).
Numbers, however, do not, and only consist of .0123456789. They can occur at any given point in a program, have any arbitrary length and are delimited purely by non-numerals.
Some examples, because that definition is pretty vague:
56 123.45/! is 56 and 123.45 followed by two other tokens
565.5345.345 % is 565.5345, 0.345 and two other tokens (incl. whitespace)
The problem I'm trying to solve is how the parser should figure out where a numeric literal ends. (Note that this is a context-free, self-modifying interpretive grammar thus there is no separate lexical analysis to be done.)
I could and have solved this with iteration:
def _next_notinst(self, atindex, subs = DIGITS):
"""return the next index of a char not in subs"""
for i, e in enumerate(self.toklist[atindex:]):
if e not in subs:
return i - len(self.toklist)
else:
break
return self.idx.v
(I don't think I need to clarify the variables, since it's an example and extremely straightforward.)
Great! That works, but there are at least two issues:
It's O(n) for a number with digit-length n. Not ideal.*
The parser class of which this method is a member is already using a while True: to cycle over arbitrary parts of the string, and I would prefer not having remotely nested loops when I don't need to.
From the previous bullet: since the parser uses arbitrary k lookahead and skipahead, parsing each individual token is absolutely not what I want.
I don't want to use RegEx mostly because I don't know it, and using it for this right now would make my code uncomprehendable to me, its creator.
There must be a simple, < O(n) solution to this, that simply collects the contiguous numerals in a string given a starting point, up until a non-numeral.
*Yes, I'm fully aware the parser itself is O(n), but we don't also need the number catenator to be > O(n). If you don't believe me, the string catenator is O(1) because it simply looks for the next unescaped " in the program and then joins all the chars up to that. Can't I do the same thing for numbers?

My other answer was actually erroneous due to lack of testing.
I decided to suck it up and learn a little bit of RegEx just because it's the only other way to solve this.
^([.\d]+[.\d]+|[.\d]) works for what I want, and matches these:
123.43.453""
.234234!/%
but not, for example:
"1233

Most efficient way to check if any substrings in list are in another list of strings

I have two lists, one of words, and another of character combinations. What would be the fastest way to only return the combinations that don't match anything in the list?
I've tried to make it as streamlined as possible, but it's still very slow when it uses 3 characters for the combinations (goes up to 290 seconds for 4 characters, not even going to try 5)
Here's some example code, currently I'm converting all the words to a list, and then searching the string for each list value.
#Sample of stuff
allCombinations = ["a","aa","ab","ac","ad"]
allWords = ["testing", "accurate" ]
#Do the calculations
allWordsJoined = ",".join( allWords )
invalidCombinations = set( i for i in allCombinations if i not in allWordsJoined )
print invalidCombinations
#Result: set(['aa', 'ab', 'ad'])
I'm just curious if there's a better way to do this with sets? With a combination of 3 letters, there are 18278 list items to search for, and for 4 letters, that goes up to 475254, so currently my method isn't really fast enough, especially when the word list string is about 1 million characters.
Set.intersection seems like a very useful method if you need the whole string, so surely there must be something similar to search for a substring.

The first thing that comes to mind is that you can optimize lookup by checking current combination against combinations that are already "invalid". I.e. if ab is invalid, than ab.? will be invalid too and there's no point to check such.
And one more thing: try using
for i in allCombinations:
if i not in allWordsJoined:
invalidCombinations.add(i)
instead of
invalidCombinations = set(i for i in allCombinations if i not in allWordsJoined)
I'm not sure, but less memory allocations can be a small boost for real data run.

Seeing if a set contains an item is O(1). You would still have to iterate through your list of combinations (with some exceptions. If your word doesn't have "a" it's not going to have any other combinations that contain "a". You can use some tree-like data structure for this) to compare with your original set of words.
You shouldn't convert your wordlist to a string, but rather a set. You should get O(N) where N is the length of your combinations.
Also, I like Python, but it isn't the fastest of languages. If this is the only task you need to do, and it needs to be very fast, and you can't improve the algorithm, you might want to check out other languages. You should be able to very easily prototype something to get an idea of the difference in speed for different languages.

How to make binary to hex converter using for loop - Python

Yes, this is homework.
I have the basic idea. I know that basically I need to introduce a for loop and set if's saying if the value is above 9 then it's a, b, c, and so forth. But what I need is to get the for loop to grab the integer and its index number to calculate and go back and forth and then print out the hex. by the way its an 8 bit binary number and has to come out in two digit hex form.
thanks a lot!!

I'm assuming that you have a string containing the binary data.
In Python, you can iterate over all sorts of things, strings included. It becomes as simple as this:
for char in mystring:
pass
And replace pass with your suite (a term meaning a "block" of code). At this point, char will be a single-character string. Nice an straight forward.
For getting the character ordinal, investigate ord (find help for it yourself, it's not hard and it's good practice).
For converting the number to hex, you could use % string formatting with '%x', which will produce a value like '9f', or you could use the hex function, which will produce a value like '0x9f'; there are other ways, too.
If you can't figure any thing out, ask; but try to work it out first. It's your homework. :-)

So assuming that you've got the binary number in a string, you will want to have an index variable that gets incremented with each iteration of the for loop. I'm not going to give you the exact code, but consider this:
Python's for loop is designed to set the index variable (for index in list) to each value of a list of values.
You can use the range function to generate a list of numbers (say, from 0 to 7).
You can get the character at a given index in a string by using e.g. binary[index].

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to work with very long strings in Python? - python

Just as a word of warning be careful when using the replace() function. If your strings are very large (in my case ~ 5e6 chars) the replace function would return a subset of the string (around ~ 4e6 chars) without throwing any errors.

You could treat D as a byte stream file. Something like:- seedfile = open('D1.txt', 'w'); seedfile.write("Fa"); seedfile.close(); n = 0 while (n warning totally untested

Related

How to call another function's results

How can I get Regex to remove redundancies and call itself again?

Finding the end of a contiguous substring of a string without iteration or RegEx

Most efficient way to check if any substrings in list are in another list of strings

How to make binary to hex converter using for loop - Python

Categories

Resources