Regex: how to use re.sub with variable number of elements? - python

I'm trying to replace {x;y} patterns in a text corpus with "x or y", except that the number of elements is variable, so sometimes there will be 3 or more elements i.e. {x;y;z} (max is 9).
I'm trying to do this with regex, but I'm not sure how to do this such that I can replace according to the number of elements present. So I mean like, if I use a regex with a variable component like the following
part = '(;[\w\s]+)'
regex = '\(([\w\s]+);([\w\s]+){}?\)'.format(part)
re.sub(regex,/1 or /2 or /3, text)
I will sometimes get an additional 'or' (and more if I increase the number of variable elements) when there are only 2 elements present in the braces, which I don't want. The alternative is to do this many times with different number of variable parts but the code would be very clunky. I'm wondering if there are any ways I could achieve this with regex methods? Would appreciate any ideas.
I'm using python3.5 with spyder.

The scenario is just a bit too much for a regular search-and-replace action, so I would recommend passing in a function to dynamically generate the replacement string.
import re
text = 'There goes my {cat;dog} playing in the {street;garden}.'
def replacer(m):
return m.group(1).replace(';', ' or ')
output = re.sub(r'\{((\w;?)*\w)\}', replacer, text)
print(output)
Output:
There goes my cat or dog playing in the street or garden.

Related

How to separate user's input with two separators? And controlling the users input

I want to separate the users input using two different separators which are ":" and ";"
Like the user should input 4 subject and it's amounts. The format should be:
(Subject:amount;Subject:amount;Subject:amount;Subject:amount)
If the input is wrong it should print "Invalid Input "
Here's my code but I can only used one separator and how can I control the users input?
B = input("Enter 4 subjects and amount separated by (;) like Math:90;Science:80:").split(";")
Please help. I can't figure it out.
If you are fine with using regular expressions in python you could use the following code:
import re
output_list = re.split("[;:]", input_string)
Where inside the square brackets you include all the characters (also known as delimiters) that you want to split by, just make sure to keep the quotes around the square brackets as that makes a regex string (what we are using to tell the computer what to split)
Further reading on regex can be found here if you feel like it: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
However, if you want to do it without importing anything you could do this, which is another possible solution (and I would recommend against, but it gets the job done well):
input_string = input_string.replace(";", ":")
output_list = input_string.split(":")
Which works by first replacing all of the semicolons in the input string with colons (it could also work the other way around) and then splitting by the remaining character (in this case the colons)
Hope this helped, as it is my first answer on Stack overflow.

How does Python Re deal with a substring (multiple words) passed as a pattern?

I am building a naive String-Matching algorithm with a source to search through. The algorithm uses re.finditer, and it works fine when supplying a single string like "crow" as a pattern, however, I get weird results that do not match the pattern when I try to search for a phrase, like "you know nothing, jon snow", it either does not find it, or finds something else, completely unrelated. That leads me to believe that there is a problem with passing a straight string as a pattern, probably related to whitespaces.
Code:
def naiveMatch(pattern, text):
print(pattern)
i=0
for match in re.finditer(pattern, text):
print(match.start())
i=i+1
print("I: ", i)
As a basis for this test, I am using GOT dialogue. The word "crow" matches 18 times in S3E2. The string "You know nothing, Jon Snow" matches in Episode 2 as well, even though it shouldn't be there. The position it matches is:
'r? This child? TYRION: I '
In Episode 7, the string should be matched, but isn't.
This leads me to believe there is some issue with supplying this string without processing and giving it to the regex.
Update, this problem was actually caused by a wrong order of files in the os.listdir(), which put file 10 between file 1 and 2.
['e1.txt', 'e10.txt', 'e2.txt', 'e3.txt', 'e4.txt', 'e5.txt', 'e6.txt', 'e7.txt', 'e8.txt', 'e9.txt']
Problem was fixed by using a sorted() function with lambda to sort based on the numbers in the filenames.
key=lambda a: int(re.search("[0-9][0-9]?[0-9]?", a, flags=0).group(0)))

Indexing the wrong character for an expression

My program seems to be indexing the wrong character or not at all.
I wrote a basic calculator that allows expressions to be used. It works by having the user enter the expression, then turning it into a list, and indexing the first number at position 0 and then using try/except statements to index number2 and the operator. All this is in a while loop that is finished when the user enters done at the prompt.
The program seems to work fine if I type the expression like this "1+1" but if I add spaces "1 + 1" it cannot index it or it ends up indexing the operator if I do "1+1" followed by "1 + 1".
I have asked in a group chat before and someone told me to use tokenization instead of my method, but I want to understand why my program is not running properly before moving on to something else.
Here is my code:
https://hastebin.com/umabukotab.py
Thank you!
Strings are basically lists of characters. 1+1 contains three characters, whereas 1 + 1 contains five, because of the two added spaces. Thus, when you access the third character in this longer string, you're actually accessing the middle element.
Parsing input is often not easy, and certainly parsing arithmetic expressions can get tricky quite quickly. Removing spaces from the input, as suggested by #Sethroph is a viable solution, but will only go that far. If you all of a sudden need to support stuff like 1+2+3, it will still break.
Another solution would be to split your input on the operator. For example:
input = '1 + 2'
terms = input.split('+') # ['1 ', ' 2'] note the spaces
terms = map(int, terms) # [1, 2] since int() can handle leading/trailing whitespace
output = terms[0] + terms[1]
Still, although this can handle situations like 1 + 2 + 3, it will still break when there's multiple different operators involved, or there are parentheses (but that might be something you need not worry about, depending on how complex you want your calculator to be).
IMO, a better approach would indeed be to use tokenization. Personally, I'd use parser combinators, but that may be a bit overkill. For reference, here's an example calculator whose input is parsed using parsy, a parser combinator library for Python.
You could remove the spaces before processing the string by using replace().
Try adding in:
clean_input = hold_input.replace(" ", "")
just after you create hold_input.

Beginner with regular expressions; need help writing a specific query - space, followed by 1-3 numbers, followed by any number of letters

I'm working with some poorly formatted HTML and I need to find every instance of a certain type of pattern. The issue is as follows:
a space, followed by a 1 to 3 digit number, followed by letters (a word, usually). Here are some examples of what I mean.
hello 7Out
how 99In
are 123May
So I would be looking for the expression to get the "7Out", "99In", "123May", etc. The initial space does not need to be included. I hope this is descriptive enough, as I am literally just starting to expose myself to regular expressions and am still struggling a bit. In the end, I will want to count the total number of these instances and add the total count to a df that already exists, so if you have any suggestions on how to do that I would be open to that as well. Thanks for your help in advance!
Your regular expression will be: r'\w\s(\d{1,3}[a-zA-Z]+)'
So in order to get count you can use len() upon list returned by findall. The code will be
import re
string='hello 70qwqeqwfwe123 12wfgtr123 34wfegr123 dqwfrgb'
result=re.findall(r'\w\s(\d{1,3}[a-zA-Z]+)',string)
print "result = ",result #this will give you all the found occurances as list
print "len(result) = ",len(result) #this will give you total no of occurances.
The result will be:
result = ['70qwqeqwfwe', '12wfgtr', '34wfegr']
len(result) = 3
Hint: findall will evaluate regular expression and returns results based on grouping. I'm using that to solve this problem.
Try these:
re.findall(r'(\w\s((\d{1,3})[a-zA-Z]+))',string)
re.findall(r'\w\s((\d{1,3})[a-zA-Z]+)',string)
To get an idea about regular expressions refer python re, tutorials point and to play with the matching characters use this.

Which is the most efficent of matching and replacing with an identifier every three new lines?

I am working with some .txt files that doesn't have structure (they are messy), they represent a number of pages. In order to give them some structure I would like to identify the number of pages since the file itself doesn't have them. This can be done by replacing every three newlines with some annotation like:
\n
page: N
\n
Where N is the number. This is how my files look like, and I also tried with a simple replace. However, this function confuses and does not give me the expected format which would be something like this. Any idea of how to replace the spaces with some kind of identifier, just to try to parse them and getting the position of some information (page)?.
I also tried this:
import re
replaced = re.sub('\b(\s+\t+)\b', '\n\n\n', text)
print (replaced)
If the format is as regular as you state in your problem description:
Replace every occurrence of three newlines \n with page: N
You wouldn't have to use the re module. Something as simple as the following would do the trick:
>>> s='aaaaaaaaaaaaaaaaa\n\n\nbbbbbbbbbbbbbbbbbbbbbbb\n\n\nccccccccccccccccccccccc'
>>> pages = s.split('\n\n\n')
>>> ''.join(page + '\n\tpage: {}\n'.format(i + 1) for i, page in enumerate(pages))
'aaaaaaaaaaaaaaaaa\n\tpage: 1\nbbbbbbbbbbbbbbbbbbbbbbb\n\tpage: 2\nccccccccccccccccccccccc\n\tpage: 3\n'
I suspect, though, that your format is less regular than that, but you'll have to include more details before I can give a good answer for that.
If you want to split with messy whitespace (which I'll define as at least three newlines with any other whitespace mixed in), you can replace s.split('\n\n\n') with:
re.split(r'(?:\n\s*?){3,}', s)

Categories