Split a string and capture all instances in python regex

Split a string and capture all instances in python regex - python

Newbie here, I have been trying to learn regex for some time but sometimes I feel I can't understand how regex is handling strings. Because in planning phase I seem to work it out, but in implementation it doesn't work as I expect it.
Here is my little problem: I have strings that contains one or more names (team names). The problem is that if the string contains more than one, there is no separator. All names are joint directly.
Some examples :
------------String -----------------Contains----------Names to be extracted
'RangersIslandersDevils' --> 3 names ->>> [Rangers, Islanders, Devils]
'49ersRaiders' -------------> 2 names ->>> [49ers, Raiders]
'Avalanche'----------------> 1 name ->>> [Avalanche]
'Red Wings'---------------> 1 name ->>> [Red Wings]
I want to capture each name in each string and use them in a loop later on. But I can't seem to implement the pattern I imagine for it.
The pattern implementation in my head for the strings are like this:
Start scanning the text which is expected to start with a capital
letter or number
If you see a literal 's' followed by a capital letter (like ...s[A-Z]..) capture the text until "s" (including s)
Repeat step two until you no more see (....s[A-Z]..) pattern. And capture the rest of the string as the last name.
Optionally, Write all names in a list
Well I tried in vain some code in which the step two captures only one instance and step 3 normally gives another.
re.findall('([A-Z0-9].*s)*([A-Z].*)+', 'RangersIslandersMolsDevil')
That returns only two names:
[('RangersIslandersMols', 'Devil')]
whereas I want four:
[Rangers, Islanders, Mols, Devil]

([A-Z0-9].*s)* will capture as many of any character as it can, so that's causing 'RangersIslandersMols' to get stuck together as one match.
It sounds like the boundary between team names is defined as a lowercase letter (not necessarily an 's', as in 'Avalanche') followed immediately by an uppercase letter or number, so our pattern should look for:
uppercase letters or numbers, followed by
lowercase letters
Because a team name can have multiple words, we'll also look for a space followed by the same pattern as above, for any possible number of words.
Try this pattern:
>>> pattern = r'[A-Z0-9]+[a-z]+(?: [A-Z0-9]+[a-z]+)*'
>>> findall(pattern, "RangersIslandersDevils49ersWashginton Football TeamAvalancheWarriors")
['Rangers', 'Islanders', 'Devils', '49ers', 'Washginton Football Team', 'Avalanche', 'Warriors']

Related

Regex - finding a string that occurs in another string every n (fixed) characters

My question title is probably unclear so i'm going to try explain with a few examples to better explain
I have only just started using regex, and would like to use it to match a string that occurs (hidden) within another string every n characters, for example:
We want to find the world "hello" in this string:
"h..e..l..l..o......"
Note how it repeats every 2 characters
I was able to create a simple regex rule for this as follows:
"h.{2}e.{2}l.{2}l.{2}o"
However this only works once every 2 steps, I wanted it any amount of steps (or at least say up to 5) - such that it would also match a string like:
"h...e...l...l...o"
What I did try:
"h.{1,5}e.{1,5}l.{1,5}l.{1,5}o"
However this does not fix a number in all the ranges so say this string would pass even though it doesn't appear in fixed intervals:
"h..e..l.l..o"
Notice the single character between the 2 l's. My only other thought here would be using a loop to say iterate 1 to 5 and try all of these as individual regex statements, but thought would be useful to try asking here to see if there are any neat ways I can do this with regex :)
Thankyou for any responses

I suggest using a capture group and back-reference for this:
h(\.{1,5})e\1l\1l\1o
RegEx Demo
Here (\.{1,5}) after first letter h will match 1 to 5 dots and capture in group #1. Afterwards we use back-reference #1 i.e. \1 between each character to make sure we match same number of dots between all letters.

Try ''.join(re.findall(r"[a-zA-Z]",text)) if you wish to extract only alphabets from the text and it is one word.

Python Regular expression search specific string beside number

I need help here.
I have a list and string.
Things I want to do is to find all the numbers from the string and also match the words from the list in the string that are beside numbers.
str = 'Lily goes to school everyday at 9:00. Her House is near to her school.
Lily's address - Flat No. 203, 14th street lol lane, opp to yuta mall,
washington. Her school name is kids International.'
list = ['school', 'international', 'house', 'flat no']
I wrote a regex which can pull numbers
x = re.findall('([0-9]+[\S]+[0-9]+|[0-9]+)' , str,re.I|re.M)
Output I want:
Numbers - ['9:00', '203', '14th']
Flat No.203 (because flat no is beside 203)
14 is also beside string but I dont want it because it is not contained in list.
But How can I write regex to make second condition satisfy. that is to search
whether flat no is beside 203 or not in same regex.

There you go:
(\d{1,2}:\d{1,2})|(?:No\. (\d+))|(\d+\w{2})
Demo on Regex101.com can be found here
What does it do and how does it work?
I use two pipes (|) to gather different number "types" you want:
First alteration ((\d{1,2}:\d{1,2}) - captures time using 1-2 digits followed by a colon and another set of 1-2 digits (probably you could go for 2 digits only).
Second alteration (?:No\. (\d+)) - gives you the number prefixed with literal "No. " (note the space at the end), and then captures following number, no matter how long (at least one digit)
The third and the last part (\d+\w{2}) - simply captures any number of digits (again, at least one) followed by two word characters. You could further improve this part of the regex to match only st, nd, and th suffixes, but I will leave this up to you.
Also to get rid of further unneeded matches you could use lookarounds, but again - I'll leave this up to you to implement.
General note - rather than using one regex to rule... erm - match them all, you should focus on creating many simple regexes. Not only will this improve legibility, but also maintainability of the regexes. This also allows you to search for timestamps, building numbers and positional numerals separately, easily allowing you to split this information to specific variables.

Numeration of groups in regular expression

I am learning regular expressions in Python but couldn't find out what is numeration in .group() based on.
Here is my code:
import re
string = 'suzi sabin joe brandon josh'
print(re.search(r'^.*\b(suzi|sabin|joe|brandon|josh)\b.*$', string).group(0))
# output : suzi sabin joe brandon josh
print(re.search(r'^.*\b(suzi|sabin|joe|brandon|josh)\b.*$', string).group(1))
# output : josh
I am wondering
Why is there only group(1) and not group(1-5)?
Why was josh classified into group(1)?
I am thankful for any advice.

When you call group(0), you get the whole matched text, which is the whole string, since your pattern matches from the beginning of the string to the end.
While the regex matches everything, it only captures one name (in group 1 because regex counts from 1 for historical reasons). Because the first .* is greedy (it tries to match as much text as possible), it gobbles up the earlier names, and the captured name is the last one, "josh" (and the last .* matches an empty string). The captured name is what you get when you call group(1).
If you want to separately capture each name, you'll need to do things differently. Probably something like this would work:
print(re.findall(r'\b(suzi|sabin|joe|brandon|josh)\b', string))
This will print the list ['suzi', 'sabin', 'joe', 'brandon', 'josh']. Each name will appear in the output in the same order it appears in the input string, which need not be the same order they were in the pattern. This might not do exactly what you want though, since it will skip over any text that isn't one of the names you're looking for (rather than failing to match anything).

How to make regex that matches a number with commas for every three digits?

I am a beginner in Python and in regular expressions and now I try to deal with one exercise, that sound like that:
How would you write a regex that matches a number with commas for
every three digits? It must match the following:
'42'
'1,234'
'6,368,745'
but not the following:
'12,34,567' (which has only two digits between the commas)
'1234' (which lacks commas)
I thought it would be easy, but I've already spent several hours and still don't have write answer. And even the answer, that was in book with this exercise, doesn't work at all (the pattern in the book is ^\d{1,3}(,\d{3})*$)
Thank you in advance!

The answer in your book seems correct for me. It works on the test cases you have given also.
(^\d{1,3}(,\d{3})*$)
The '^' symbol tells to search for integers at the start of the line. d{1,3} tells that there should be at least one integer but not more than 3 so ;
1234,123
will not work.
(,\d{3})*$
This expression tells that there should be one comma followed by three integers at the end of the line as many as there are.
Maybe the answer you are looking for is this:
(^\d+(,\d{3})*$)
Which matches a number with commas for every three digits without limiting the number being larger than 3 digits long before the comma.

You can go with this (which is a slightly improved version of what the book specifies):
^\d{1,3}(?:,\d{3})*$
Demo on Regex101

I got it to work by putting the stuff between the carrot and the dollar in parentheses like so: re.compile(r'^(\d{1,3}(,\d{3})*)$')
but I find this regex pretty useless, because you can't use it to find these numbers in a document because the string has to begin and end with the exact phrase.

#This program is to validate the regular expression for this scenerio.
#Any properly formattes number (w/Commas) will match.
#Parsing through a document for this regex is beyond my capability at this time.
print('Type a number with commas')
sentence = input()
import re
pattern = re.compile(r'\d{1,3}(,\d{3})*')
matches = pattern.match(sentence)
if matches.group(0) != sentence:
#Checks to see if the input value
#does NOT match the pattern.
print ('Does Not Match the Regular Expression!')
else:
print(matches.group(0)+ ' matches the pattern.')
#If the values match it will state verification.

The Simple answer is :
^\d{1,2}(,\d{3})*$
^\d{1,2} - should start with a number and matches 1 or 2 digits.
(,\d{3})*$ - once ',' is passed it requires 3 digits.
Works for all the scenarios in the book.
test your scenarios on https://pythex.org/

I also went down the rabbit hole trying to write a regex that is a solution to the question in the book. The question in the book does not assume that each line is such a number, that is, there might be multiple such numbers in the same line and there might some kind of quotation marks around the number (similar to the question text). On the other hand, the solution provided in the book makes those assumptions: (^\d{1,3}(,\d{3})*$)
I tried to use the question text as input and ended up with the following pattern, which is way too complicated:
r'''(
(?:(?<=\s)|(?<=[\'"])|(?<=^))
\d{1,3}
(?:,\d{3})*
(?:(?=\s)|(?=[\'"])|(?=$))
)'''
(?:(?<=\s)|(?<=[\'"])|(?<=^)) is a non-capturing group that allows
the number to start after \s characters, ', ", or the start of the text.
(?:,\d{3})* is a non-capturing group to avoid capturing, for example, 123 in 12,123.
(?:(?=\s)|(?=[\'"])|(?=$)) is a non-capturing group that allows
the number to end before \s characters, ', ", or the end of the text (no newline case).
Obviously you could extend the list of allowed characters around the number.

Matching an object and a specific regex with Python

Given a text, I need to check for each char if it has exactly (edited) 3 capital letters on both sides and if there are, add it to a string of such characters that is retured.
I wrote the following: m = re.match("[A-Z]{3}.[A-Z]{3}", text)
(let's say text="AAAbAAAcAAA")
I expected to get two groups in the match object: "AAAbAAA" and "AAAcAAA"
Now, When i invoke m.group(0) I get "AAAbAAA" which is right. Yet, when invoking m.group(1), I find that there is no such group, meaning "AAAcAAA" wasn't a match. Why?
Also, when invoking m.groups(), I get an empty tuple although I should get a tuple of the matches, meaning that in my case I should have gotten a tuple with "AAAbAAA". Why doesn't that work?

You don't have any groups in your pattern. To capture something in a group, you have to surround it with parentheses:
([A-Z]{3}).[A-Z]{3}
The exception is m.group(0), which will always contain the entire match.
Looking over your question, it sounds like you aren't actually looking for capture groups, but rather overlapping matches. In regex, a group means a smaller part of the match that is set aside for later use. For example, if you're trying to match phone numbers with something like
([0-9]{3})-([0-9]{3}-[0-9]{4})
then the area code would be in group(1), the local part in group(2), and the entire thing would be in group(0).
What you want is to find overlapping matches. Here's a Stack Overflow answer that explains how to do overlapping matches in Python regex, and here's my favorite reference for capture groups and regex in general.

One, you are using match when it looks like you want findall. It won't grab the enclosing capital triplets, but re.findall('[A-Z]{3}([a-z])(?=[A-Z]{3})', search_string) will get you all single lower case characters surrounded on both sides by 3 caps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split a string and capture all instances in python regex - python

Related

Regex - finding a string that occurs in another string every n (fixed) characters

Python Regular expression search specific string beside number

Numeration of groups in regular expression

How to make regex that matches a number with commas for every three digits?

Matching an object and a specific regex with Python

Categories

Resources