How can I get Regex to remove redundancies and call itself again?

How can I get Regex to remove redundancies and call itself again? - python

I have a simple function which when given an input like (x,y), it will return {{x},{x,y}}.
In the cases that x=y, it naturally returns {{x},{x,x}}.
I can't figure out how to get Regex to substitute 'x' in place of 'x,x'. But even if I could figure out how to do this, the expression would go from {{x},{x,x}} to {{x},{x}}, which itself would need to be substituted for {{x}}.
The closest I have gotten has been:
re.sub('([0-9]+),([0-9]+)',r'\1',string)
But this function will also turn {{x},{x,y}} into {{x},{x}}, which is not desired. Also you may notice that the function searches for numbers only, which is fine because I really only intend to be using numbers in the place of x and y; however, if there is a way to get it to work with any letter as well (lower case or capital) the would be even more ideal.
Note also that if I give my original function (x,y,z) it will read it as ((x,y),z) and thus return {{{{x},{x,y}}},{{{x},{x,y}},z}}, thus in the case that x=y=z, I would want to be able to have a Regex function call itself repeatedly to reduce this to {{{{x}}},{{{x}},x}} instead of {{{{x},{x,x}}},{{{x},{x,x}},x}}.
If it helps at all, this is essentially an attempt at making a translation (into sets) using the Kuratowski definition of an ordered pair.

Essentially to solve this you need recursion, or more simply, keep applying the regex in a loop until the replacement doesn't change the input string. For example using your regex from https://regex101.com/r/Yl1IJv/4:
s = '{{ab},{ab,ab}}'
while True:
news = re.sub(r'(?P<first>.?(\w+|\d+).?),(?P=first)', r'\g<1>', s, 0)
if news == s:
break
s = news
print(s)
Output
{{ab}}
Demo on rextester
With
s = '{{{{x},{x,x}}},{{{x},{x,x}},x}}'
The output is
{{{{x}}},{{{x}},x}}
as required. Demo on rextester

Related

How to call another function's results

def most_frequency_occ(chars,inputString):
count = 0
for ind_char in inputString:
ind_char = ind_char.lower()
if chars == ind_char:
count += 1
return count
def general(inputString):
maxOccurences = 0
for chars in inputString:
most_frequency_occ(chars, inputString)
This is my current code. I'm trying to find the most frequent occurring letter in general. I created another function called most_frequency_occ that finds a specific character in the string that occurs the most often, but how do I generalize it into finding the frequent letter in a string without specifying a specific character and only using loops, without any build in string functions either.
For example:
print(general('aqweasdaza'))
should print 4 as "a" occurs the most frequently, occurring 4 times.

If I got your task, I think that using a dictionary will be more comfortable for you.
# initializing string
str = "Hello world"
# initializing dict of freq
freq = {}
for i in str:
if i in freq:
freq[i] += 1
else:
freq[i] = 1
# Now, you have the count of every char in this string.
# If you want to extract the max, this step will do it for you:
max_freq_chr = max(stats.values())

There are multiple ways you find the most common letter in a string.
One easy to understand and cross-language way of doing this would be:
initialize an array of 26 integers set to 0.
go over each letter one by one of your string, if the first letter is an B (B=2), you can increment the second value of the array
Find the largest value in your array, return the corresponding letter.
Since you are using python, you could use dictionaries since it would be less work to implement.
A word of caution, it sounds like you are doing a school assignment. If your school has a plagiarism checker that checks the internet, you might be caught for academic dishonesty if you copy paste code from the internet.

The other answers have suggested alternative ways of counting the letters in a string, some of which may be better than what you've come up with on your own. But I think it may be worth answering your question about how to call your most_frequency_occ function from your general function even if the algorithm isn't great, since you'll need to understand how functions work in other contexts.
The thing to understand about function calls is that the call expression will be evaluated to the value returned by the function. In this case, that's the count. Often you may want to assign the return value to a variable so you can reference it multiple times. Here's what that might look like:
count = most_frequency_occ(chars, inputString)
Now you can do a comparsion between the count and the previously best count to see if you've just checked the most common letter so far:
maxOccurences = 0
for chars in inputString:
count = most_frequency_occ(chars, inputString)
if count > maxOccurences: # check if chars is more common than the previous best
maxOccurences = count
return maxOccurences
One final note: Some of your variable and function names are a bit misleading. That often happens when you're changing your code around from one design to another, but not changing the variable names at the same time. You may want to occasionally reread your code and double check to make sure that the variable names still match what you're doing with them. If not, you should "refactor" your code by renaming the variables to better match their actual uses.
To be specific, your most_frequency_occ function isn't actually finding the most frequent character itself, it's only doing a small step in that process, counting how often a single character occurs. So I'd call it count_char or something similar. The general function might be named something more descriptive like find_most_frequent_character.
And the variable chars (which exists in both functions) is also misleading since it represents a single character, but the name chars implies something plural (like a list or a string that contains several characters). Renaming it to char might be better, as that seems more like a singular name.

Python regex matching words with repeating consonant

First off, this is homework. (I couldn't use a tag in the title and nothing showed up in the tag list at the bottom for homework, so please let me know if I should EDIT something else regarding this matter).
So I have been reading through the python docs and scavenging SO, finding several solutions that are close to what I want, but not exact.
I have a dictionary which I read in to a string:
a
aa
aabbaa
...
z
We are practicing various regex patters on this data.
The specific problem here is to return a list of words which match the pattern, NOT tuples with the groups within each match.
For example:
Given a subset of this dictionary like:
someword
sommmmmeword
someworddddd
sooooomeword
I want to return:
['sommmmmword', 'someworddddd']
NOT:
[('sommmmword', 'mmmmm', ...), ...] # or any other variant
EDIT:
My reasoning behind the above example, is that I want to see how I can avoid making a second pass over the results. That is instead of saying:
res = re.match(re.compile(r'pattern'), dictionary)
return [r[0] for r in res]
I specifically want a mechanism where I can just use:
return re.match(re.compile(r'pattern'), dictionary)
I know that may sound silly, but I am doing this to really dig into regex. I mention this at the bottom.
This is what I have tried:
# learned about back refs
r'\b([b-z&&[^eiou]])\1+\b' -> # nothing
# back refs were weird, I want to match something N times
r'\b[b-z&&[^eiou]]{2}\b' -> # nothing
Somewhere in testing I noticed a pattern returning things like '\nsomeword'. I couldn't figure out what it was but if I find the pattern again I will include it here for completeness.
# Maybe the \b word markers don't work how I think?
r'.*[b-z&&[^eiou]]{2}' -> # still nothing
# Okay lets just try to match something in between anything
r'.*[b-z&&[^eiou]].*' -> # nope
# Since its words, maybe I should be more explicit.
r'[a-z]*[b-z&&[^eiou]][a-z]*' -> # still nope
# Decided to go back to grouping.
r'([b-z&&[^eiou]])(\1)' # I realize set difference may be the issue
# I saw someone (on SO) use set difference claiming it works
# but I gave up on it...
# OKAY getting close
r'(([b-df-hj-np-tv-xz])(\2))' -> [('ll', 'l', 'l'), ...]
# Trying the the previous ones without set difference
r'\b(.*(?:[b-df-hj-np-tv-xz]{3}).*)\b' -> # returned everything (all words)
# Here I realize I need a non-greedy leading pattern (.* -> .*?)
r'\b(.*?(?:[b-df-hj-np-tv-xz]{3}).*)\b' -> # still everything
# Maybe I need the comma in {3,} to get anything 3 or more
r'\b(.*?(?:[b-df-hj-np-tv-xz]{3,}).*)\b' -> # still everything
# okay I'll try a 1 line test just in case
r'\b(.*?([b-df-hj-np-tv-xz])(\2{3,}).*)\b'
# Using 'asdfdffff' -> [('asdfdffff', 'f', 'fff')]
# Using dictionary -> [] # WAIT WHAT?!
How does this last one work? Maybe there there are no 3+ repeating consonant words? I'm using /usr/share/dict/cracklib-small on my schools server which is about 50,000 words I think.
I am still working on this but any advice would be awesome.
One thing I find curious is that you can not back reference a non-capturing group. If I want to output only the full word, I use (?:...) to avoid capture, but then I can not back reference. Obviously I could leave the captures, loop over the results and filter out the extra stuff, but I absolutely want to figure this out using ONLY regex!
Perhaps there is a way to do the non-capture, but still allow back reference? Or maybe there is an entirely different expression I haven't tested yet.

Here are some points to consider:
Use re.findall to get all the results, not re.match (that only searches for 1 match and only at the string start).
[b-z&&[^eiou]] is a Java/ICU regex, this syntax is not supported by Python re. In Python, you can either redefine the ranges to skip the vowels, or use (?![eiou])[b-z].
To avoid "extra" values in tuples with re.findall, do not use capturing groups. If you need backreferences, use re.finditer instead of re.findall and access .group() of each match.
Coming back to the question, how you can use a backreference and still get the whole match, here is a working demo:
import re
s = """someword
sommmmmeword
someworddddd
sooooomeword"""
res =[x.group() for x in re.finditer(r"\w*([b-df-hj-np-tv-xz])\1\w*", s)]
print(res)
# => ['sommmmmeword', 'someworddddd']

Python- find substring and then replace all characters within it

Let's say I have this string :
<div>Object</div><img src=#/><p> In order to be successful...</p>
I want to substitute every letter between < and > with a #.
So, after some operation, I want my string to look like:
<###>Object<####><##########><#> In order to be successful...<##>
Notice that every character between the two symbols were replaced with # ( including whitespace).
This is the closest I could get:
r = re.sub('<.*?>', '<#>', string)
The problem with my code is that all characters between < and > are replaced by a single #, whereas I would like every individual character to be replaced by a #.
I tried a mixture of various back references, but to no avail. Could someone point me in the right direction?

What about...:
def hashes(mo):
replacing = mo.group(1)
return '<{}>'.format('#' * len(replacing))
and then
r = re.sub(r'<(.*?)>', hashes, string)
The ability to use a function as the second argument to re.sub gives you huge flexibility in building up your substitutions (and, as usual, a named def results in much more readable code than any cramped lambda -- you can use meaningful names, normal layouts, etc, etc).

The re.sub function can be called with a function as the replacement, rather than a new string. Each time the pattern is matched, the function will be called with a match object, just like you'd get using re.search or re.finditer.
So try this:
re.sub(r'<(.*?)>', lambda m: "<{}>".format("#" * len(m.group(1))), string)

Simple regular expression not working

I am trying to match a string with a regular expression but it is not working.
What I am trying to do is simple, it is the typical situation when an user intruduces a range of pages, or single pages. I am reading the string and checking if it is correct or not.
Expressions I am expecting, for a range of pages are like: 1-3, 5-6, 12-67
Expressions I am expecting, for single pages are like: 1,5,6,9,10,12
This is what I have done so far:
pagesOption1 = re.compile(r'\b\d\-\d{1,10}\b')
pagesOption2 = re.compile(r'\b\d\,{1,10}\b')
Seems like the first expression works, but not the second.
And, would it be possible to merge both of them in one single regular expression?, In a way that, if the user introduces either something like 1-2, 7-10 or something like 3,5,6,7 the expression will be recogniced as good.

Simpler is better
Matching the entire input isn't simple, as the proposed solutions show, at least it is not as simple as it could/should be. Will become read only very quickly and probably be scrapped by anyone that isn't regex savvy when they need to modify it with a simpler more explicit solution.
Simplest
First parse the entire string and .split(","); into individual data entries, you will need these anyway to process. You have to do this anyway to parse out the useable numbers.
Then the test becomes a very simple, test.
^(\d+)(?:-\(d+))?$
It says, that there the string must start with one or more digits and be followed by optionally a single - and one or more digits and then the string must end.
This makes your logic as simple and maintainable as possible. You also get the benefit of knowing exactly what part of the input is wrong and why so you can report it back to the user.
The capturing groups are there because you are going to need the input parsed out to actually use it anyway, this way you get the numbers if they match without having to add more code to parse them again anyway.

This regex should work -
^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$
Demo here
Testing this -
>>> test_vals = [
'1-3, 5-6, 12-67',
'1,5,6,9,10,12',
'1-3,1,2,4',
'abcd',
]
>>> regex = re.compile(r'^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$')
>>> for val in test_vals:
print val
if regex.match(val) == None:
print "Fail"
else:
print "Pass"
1-3, 5-6, 12-67
Pass
1,5,6,9,10,12
Pass
1-3,1,2,4.5
Fail
abcd
Fail

How to work with very long strings in Python?

I'm tackling project euler's problem 220 (looked easy, in comparison to some of the
others - thought I'd try a higher numbered one for a change!)
So far I have:
D = "Fa"
def iterate(D,num):
for i in range (0,num):
D = D.replace("a","A")
D = D.replace("b","B")
D = D.replace("A","aRbFR")
D = D.replace("B","LFaLb")
return D
instructions = iterate("Fa",50)
print instructions
Now, this works fine for low values, but when you put it to repeat higher then you just get a "Memory error". Can anyone suggest a way to overcome this? I really want a string/file that contains instructions for the next step.

The trick is in noticing which patterns emerge as you run the string through each iteration. Try evaluating iterate(D,n) for n between 1 and 10 and see if you can spot them. Also feed the string through a function that calculates the end position and the number of steps, and look for patterns there too.
You can then use this knowledge to simplify the algorithm to something that doesn't use these strings at all.

Python strings are not going to be the answer to this one. Strings are stored as immutable arrays, so each one of those replacements creates an entirely new string in memory. Not to mention, the set of instructions after 10^12 steps will be at least 1TB in size if you store them as characters (and that's with some minor compressions).
Ideally, there should be a way to mathematically (hint, there is) generate the answer on the fly, so that you never need to store the sequence.
Just use the string as a guide to determine a method which creates your path.

If you think about how many "a" and "b" characters there are in D(0), D(1), etc, you'll see that the string gets very long very quickly. Calculate how many characters there are in D(50), and then maybe think again about where you would store that much data. I make it 4.5*10^15 characters, which is 4500 TB at one byte per char.
Come to think of it, you don't have to calculate - the problem tells you there are 10^12 steps at least, which is a terabyte of data at one byte per character, or quarter of that if you use tricks to get down to 2 bits per character. I think this would cause problems with the one-minute time limit on any kind of storage medium I have access to :-)

Since you can't materialize the string, you must generate it. If you yield the individual characters instead of returning the whole string, you might get it to work.
def repl220( string ):
for c in string:
if c == 'a': yield "aRbFR"
elif c == 'b': yield "LFaLb"
else yield c
Something like that will do replacement without creating a new string.
Now, of course, you need to call it recursively, and to the appropriate depth. So, each yield isn't just a yield, it's something a bit more complex.
Trying not to solve this for you, so I'll leave it at that.

Just as a word of warning be careful when using the replace() function. If your strings are very large (in my case ~ 5e6 chars) the replace function would return a subset of the string (around ~ 4e6 chars) without throwing any errors.

You could treat D as a byte stream file.
Something like:-
seedfile = open('D1.txt', 'w');
seedfile.write("Fa");
seedfile.close();
n = 0
while (n
warning totally untested

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I get Regex to remove redundancies and call itself again? - python

Related

How to call another function's results

Python regex matching words with repeating consonant

Python- find substring and then replace all characters within it

Simple regular expression not working

How to work with very long strings in Python?

Categories

Resources