How to extract substring with regex - python

I have SKUs of like the following:
SBC225SLB32
SBA2161BRB30
PBA632AS32
Where the first 3-4 characters are A-Z, which must be extracted, and the following 3-4 numbers are [0-9], and also have to be extracted.
For the first, I tried \D{3,4} and for the second, I tried \d{3,4}.
But when using pandas' .str.extract('\D{3,4}'), I got a pattern contains no capture groups error.
Is there a better way to do this?

The regex pattern you pass to Series.str.extract contains no capturing groups, while the method expects at least one.
In your case, it is more convenient to grab both values at once with the help of two capturing groups. You can use
df[['Code1', 'Code2']] = df['SKU'].str.extract(r'^([A-Z]{3,4})([0-9]{3,4})', expand=False)
See the regex demo. Pattern details:
^ - start of string
([A-Z]{3,4}) - Capturing group 1: three to four uppercase ASCII letters
([0-9]{3,4}) - Capturing group 2: three to four uppercase ASCII digits.

Related

Matching two identical groups of characters in pandas with some number of characters in between

I'm trying to extract two numbers of interest from a string of docket text in a pandas dataframe. Here's an example with a couple of the idiosyncrasies that exist in the data
import pandas as pd
df = pd.DataFrame(["Fee: $ 15,732, and Expenses: $1,520.62."])
I used regexr to test some ideas and the closest I've been able to come up with is something along the lines of
df[0].str.extract("(\${0,2}\s*(\d+[,\.]*){1,5})")
Which returns:
0 1
0 $15,732,, 732,,
The problems I'm running into are making characters optional while capturing the groups (i.e. I don't know how to get rid of the inner parenthesis because if I make it brackets then I get an error). And then ideally I'd be able to match the other set of numbers too.
I used regexr and while I can make regular expressions that match what I want, I'm struggling with the grouping part so that I can capture both while not needing to use a cumbersome function like apply with re.
There are sometimes numbers that show up again later in the report that include dates, other numbers, etc... So I'm trying to find a pretty controlled sequence (Can't get too liberal with the .*'s haha)
The string I ended up writing after the hint provided in the comments is:
\$((?:\d+(?:[,\.])*)+).*?\$((?:\d+(?:[,\.])*)+). The non-matching groups is what I hadn't understood before. I thought non-matching groups meant that it would somehow remove the parts that matched from the group but really what it means is that it's a group of characters that don't count as a group (not that they'll be removed from a group).
I appreciate the feedback I got this post!
I am not sure if the text stays the same across all of the values but you can use the following regex:
r'Fee: \$\s?([\d,.]+), and Expenses:\s*\$\s?([\d,.]+)\.'
returning two matching groups:
15,732
1,520.62
You can also abstract the text:
r'\w+:\s*\$\s?([\d,.]+),(\s*\w+)+:\s*\$\s?([\d,.]+)\.'
with the same result.
You can use
df[0].str.extract(r"(\$\s*\d+(?:[,.]\d+)*)") # To get the first value
df[0].str.extractall(r"(\$\s*\d+(?:[,.]\d+)*)") # To get all values
df[0].str.findall(r"\$\s*\d+(?:[,.]\d+)*") # To get all values
The str.extract pattern is wrapped with a capturing group so that the method could return any value, it requires at least one capturing group in the regex pattern.
The regex matches
\$ - a $ char
\s* - zero or more whitespaces
\d+ - one or more digits
(?:[,.]\d+)* - a non-capturing group matching zero or more repetitions of a comma/dot and then one or more digits.
See the regex demo.

Python regex pattern matching with ranges and whitespaces

I am attempting to match strings that would have a pattern of:
two uppercase Latin letters
two digits
two uppercase Latin letters
four digits
ex: MH 45 LE 4098
There can be optional whitespaces between the first three and they need to be limited to these numbers of characters. I was trying to group them and set a limit on the characters, but I am not matching any strings that fall within the define parameters. I had attempted building a set like so template = '[A-Z{2}0-9{2,4}]', but was still receiving errors when the last digits had exceeded 4.
template = '(A-Z{2})\s?(\d{2})\s?(A-Z{2})\s?(\d{4})'
This was the other attempt when I tried being more verbose, but then couldn't match anything.
You are close; need to put a square brackets around A-Z to let {2} affect the whole range instead of only Z. As it stands it literally matches A-ZZ.
So
template = "[A-Z]{2}\s?(\d{2})\s?([A-Z]{2})\s?(\d{4})"
should do. We use [ instead of ( to imply a range of letters. If we put (, it would try to match A-ZA-Z i.e. literally A-Z two times.
You can see a demo here and you can change them to ( or omit them to see the effect in the demo.
This is probably the regex you are looking for:
[A-Z]{2}\s?[0-9]{2}\s?[A-Z]{2}\s?[0-9]{4}
Note that it allows multiple whitespace characters.

extract string betwen two strings in pandas

I have a text column that looks like:
http://start.blabla.com/landing/fb603?&mkw...
I want to extract "start.blabla.com"
which is always between:
http://
and:
/landing/
namely:
start.blabla.com
I do:
df.col.str.extract('http://*?\/landing')
But it doesn't work.
What am I doing wrong?
Your regex matches http:/, then 0+ / symbols as few as possible and then /landing.
You need to match and capture the characters (The extract method accepts a regular expression with at least one capture group.) after http:// other than /, 1 or more times. It can be done with
http://([^/]+)/landing
^^^^^^^
where [^/]+ is a negated character class that matches 1+ occurrences of characters other than /.
See the regex demo
Just to answer a question you didn't ask, if you wanted to extract several portions of the string into separate columns, you'd do it this way:
df.col.str.extract('http://(?P<Site>.*?)/landing/(?P<RestUrl>.*)')
You'd get something along the lines of:
Site RestUrl
0 start.blabla.com fb603?&mkw...
To understand how this regex (and any other regex for that matter) is constructed I suggest you take a look at the excellent site regex101. I constructed a snippet where you can see the above regex in action here.

Regular Expression for matching a string with different combinations

I'n trying to match a string with the following different combinations using python
(here x's are digits of lenght 4)
W|MON-FRI|xxxx-xxxx
W|mon-fri|xxxx-xxxx
W|MON-THU,SAT|xxxx-xxxx
W|mon-thu,sat|xxxx-xxxx
W|MON|xxxx-xxxx
Here the first part and the last is static, second part is can have any of the combinations as shown above, like sometime the days were separated by ',' or '-'.
I'm a newbie to Regular Expressions, I was googled on how regular expressions works, I can able to do the RE for bits & pieces of above expressions like matching the last part with re.compile('(\d{4})-(\d{4})$') and the first part with re.compile('[w|W]').
I tried to match the 2nd part but couldn't succeeded with
new_patt = re.compile('(([a-zA-Z]{3}))([,-]?)(([a-zA-Z]{3})?))
How can I achieve this?
Here is a regular expression that should work:
pat = re.compile('^W\|(mon|tue|wed|thu|fri|sat|sun)(-(mon|tue|wed|thu|fri|sat|sun))?(,(mon|tue|wed|thu|fri|sat|sun)(-(mon|tue|wed|thu|fri|sat|sun))?)?\⎪\d{4}-\d{4}$', re.IGNORECASE)
Note first how you can ignore the case to take care of lower and upper cases. In addition to the static text at the beginning and the numbers at the end, this regex matches a day of the week, followed by an optional dash+day of the week, followed by an optional sequence that contains a ,and the previous sequence.
"^W\|(mon|tue|wed|thu|fri|sat|sun)(-(mon|tue|wed|thu|fri|sat|sun))?(,(mon|tue|wed|thu|fri|sat|sun)(-(mon|tue|wed|thu|fri|sat|sun))?)?\|\d{4}-\d{4}$"i
^ assert position at start of the string
W matches the character W literally (case insensitive)
\| matches the character | literally
1st Capturing group (mon|tue|wed|thu|fri|sat|sun)
2nd Capturing group (-(mon|tue|wed|thu|fri|sat|sun))?
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
- matches the character - literally
3rd Capturing group (mon|tue|wed|thu|fri|sat|sun)
4th Capturing group (,(mon|tue|wed|thu|fri|sat|sun)(-(mon|tue|wed|thu|fri|sat|sun))?)?
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
, matches the character , literally
5th Capturing group (mon|tue|wed|thu|fri|sat|sun)
6th Capturing group (-(mon|tue|wed|thu|fri|sat|sun))?
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
- matches the character - literally
7th Capturing group (mon|tue|wed|thu|fri|sat|sun)
\| matches the character | literally
\d{4} match a digit [0-9]
Quantifier: {4} Exactly 4 times
- matches the character - literally
\d{4} match a digit [0-9]
Quantifier: {4} Exactly 4 times
$ assert position at end of the string
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
https://regex101.com/r/dW4dQ7/1
You can get everything in one go:
^W\|(?:\w{3}[-,]){0,2}\w{3}\|(?:\d{4}[-]?){2}$
With Live Demo
Thanks for your posts and comments,
At last I am able to satisfy my requirement with regular expressions
here it is
"^[w|W]\|(mon|sun|fri|thu|sat|wed|tue|[0-6])(-(mon|fri|sat|sun|wed|thu|tue|[0-6]))?(,(mon|fri|sat|sun|wed|thu|tue|[0-6]))*?\|(\d{4}-\d{4})$"img
I just tweaked the answer posted by Julien Spronck
Once again thanks all

Regex with repeating groups

I've been trying to match a phrase between hyphens. I realise that I can easily just split on the hyphen and get out the phrases but my equivalent regex for this is not working as expected and I want to understand why:
([^-,]+(?:(?: - )|$))+
[^-,]+ is just my definition of a phrase
(?: - ) is just the non capturing space delimited hyphen
so (?:(?: - )|$)is capturing a hyphen or end of line
Finally, the whole thing surrounded in parentheses with a + quantifier matches more than one.
What I get if I perform regex.match("A - B - C").groups() is ('C',)
I've also tried the much simpler regex ([^,-]+)+ with similar results
I'm using re.match because I wanted to use pandas.Series.str.extract to apply this to a very long list.
To reiterate: I'm now using an easy split on a hyphen but why isn't this regex returning multiple groups?
Thanks
Regular expression capturing groups are “named” statically by their appearance in the expression. Each capturing group gets its own number, and matches are assigned to that group regardless of how often a single group captures something.
If a group captured something before and later does again, the later result overwrites what was captured before. There is no way to collect all a group’s captures values using a normal matching.
If you want to find multiple values, you will need to match only a single group and repeat matching on the remainder of the string. This is commonly done by re.findall or re.finditer:
>>> re.findall('\s*([^-,]+?)\s*', 'A - B - C')
['A', 'B', 'C']

Categories