regex - Extract complete word base on match string [duplicate] - python

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 2 years ago.
Can some one please help me on this - Here I'm trying extract word from given sentence which contains G,ML,KG,L,ML,PCS along with numbers .
I can able to match the string , but not sure how can I extract the comlpete word
for example my input is "This packet contains 250G Dates" and output should be 250G
another example is "You paid for 2KG Apples" and output should be 2KG
in my regular expression I'm getting only match string not complete word :(
import re
val = 'FUJI ALUMN FOIL CAKE, 240G, CHCLTE'
key_vals = ['G','GM','KG','L','ML','PCS']
re.findall("\d+\.?\d*(\s|G|KG|GM|L|ML|PCS)\s?", val)

This regex will not get you what you want:
re.findall("\d+\.?\d*(\s|G|KG|GM|L|ML|PCS)\s?", val)
Let's break it down:
\d+: one or more digits
\.?: a dot (optional, as indicated by the question mark)
\d*: one or more optional digits
(\s|G|KG|GM|L|ML|PCS): a group of alternatives, but whitespace is an option among others, it should be out of the group: what you probably want is allow optional whitespace between the number and the unit ie: 240G or 240 G
\s?: optional whitespace
A better expression for your purpose could be:
re.findall("\d+\s*(?:G|KG|GM|L|ML|PCS)", val)
That means: one or more digits, followed by optional whitespace and then either of these units: G|KG|GM|L|ML|PCS.
Note the presence of ?: to indicate a non-capturing group. Without it the expression would return G

Try using this Regex:
\d+\s*(G|KG|GM|L|ML|PCS)\s?
It matches every string which starts with at least one digit, is then followed by one the units. Between the digits and the units and behind the units there can also be whitespaces.
Adjust this like you want to :)

Use non-grouping parentheses (?:...) instead of the normal ones. Without grouping parentheses findall returns the string(s) which match the whole pattern.

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

I feel I am having the most difficulty explaining this well enough for a search engine to pick up on what I'm looking for. The behavior is essentially this:
string = "aaaaaaaaare yooooooooou okkkkkk"
would become "aare yoou okk", with the maximum number of repeats for any given character is two.
Matching the excess duplicates, and then re.sub -ing it seems to me the approach to take, but I can't figure out the regex statement I need.
The only attempt I feel is even worth posting is this - (\w)\1{3,0}
Which matched only the first instance of a character repeating more than three times - so only one match, and the whole block of repeated characters, not just the ones exceeding the max of 2. Any help is appreciated!
The regexp should be (\w)\1{2,} to match a character followed by at least 2 repetitions. That's 3 or more when you include the initial character.
The replacement is then \1\1 to replace with just two repetitions.
string = "aaaaaaaaare yooooooooou okkkkkk"
new_string = re.sub(r'(\w)\1{2,}', r'\1\1', string)
You could write
string = "aaaaaaaaare yooooooooou okkkkkk"
rgx = (\w)\1*(?=\1\1)
re.sub(rgx, '', string)
#=> "aare yoou okk"
Demo
The regular expression can be broken down as follows.
(\w) # match one word character and save it to capture group 1
\1* # match the content of capture group 1 zero or more times
(?= # begin a positive lookahead
\1\1 # match the content of capture group 1 twice
) # end the positive lookahead

Regex for string that has 5 numbers or IND/5numbers

I am trying to build a regex to match 5 digit numbers or those 5 digit numbers preceded by IND/
10223 match to return 10223
IND/10110 match to return 10110
ID is 11233 match to return 11233
Ref is:10223 match to return 10223
Ref is: th10223 not match
SBI12234 not match
MRF/10234 not match
RBI/10229 not match
I have used the foll. Regex which selects the 5 digit correctly using word boundary concept. But not sure how to allow IND and not allow anything else like MRF, etc:
/b/d{5}/b
If I put (IND)? At beginning of regex then it won't help. Any hints?
Use a look behind:
(?<=^IND\/|^ID is |^)\d{5}\b
See live demo.
Because the look behind doesn’t consume any input, the entire match is your target number (ie there’s no need to use a group).
Variable length lookbehind is not supported by python, use alternation instead:
(?:(?<=IND/| is[: ])\d{5}|^\d{5})(?!\d)
Demo
This should work: (?<=IND/|\s|^)(\d{5})(?=\s|$) .
Try this: (?:IND\/|ID is |^)\b(\d{5})\b
Explanation:
(?: ALLOWED TEXT): A non-capture group with all allowed segments inside. In your example, IND\/ for "IND/", ID is for "ID is ...", and ^ for the beginning of the string (in case of only the number / no text at start: 12345).
\b(\d{5})\b: Your existing pattern w/ capture group for 5-digit number
I feel like this will need some logic to it. The regex can find the 5 digits, but maybe a second regex pattern to find IND, then join them together if need be. Not sure if you are using Python, .Net, or Java, but should be doable

How to group inside "or" matching in a regex?

I have two kinds of documents to parse:
1545994641 INFO: ...
and
'{"deliveryDate":"1545994641","error"..."}'
I want to extract the timestamp 1545994641 from each of them.
So, I decided to write a regex to match both cases:
(\d{10}\s|\"\d{10}\")
In the 1st kind of document, it matches the timestamp and groups it, using the first expression in the "or" above (\d{10}\s):
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg="1545994641 INFO: ..."
>>> regex.search(msg).group(0)
'1545994641 '
(So far so good.)
However, in the 2nd kind, using the second expression in the "or" (\"\d{10}\") it matches the timestamp and quotation marks, grouping them. But I just want the timestamp, not the "":
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg='{"deliveryDate":"1545994641","error"..."}'
>>> regex.search(msg).group(0)
'"1545994641"'
What I tried:
I decided to use a non-capturing group for the quotation marks:
(\d{10}\s|(?:\")\d{10}(?:\"))
but it doesn't work as the outer group catches them.
I also removed the outer group, but the result is the same.
Unwanted ways to solve:
I can surpass this by creating a group for each expression in the or,
but I just want it to output a single group (to abstract the code
from the regex).
I could also use a 2nd step of regex to capture the timestamp from
the group that has the quotation marks, but again that would break
the code abstraction.
I could omit the "" in the regex but that would match a timestamp in the middle of the message , as I want it to be objective to capture the timestamp as a value of a key or in the beginning of the document, followed by a space.
Is there a way I can match both cases above but, in the case it matches the second case, return only the timestamp? Or is it impossible?
EDIT:
As noticed by #Amit Bhardwaj, the first case also returns a space after the timestamp. It's another problem (I didn't figure out) with the same solution, probably!
You may use lookarounds if your code can only access the whole match:
^\d{10}(?=\s)|(?<=")\d{10}(?=")
See the regex demo.
In Python, declare it as
rx = r'^\d{10}(?=\s)|(?<=")\d{10}(?=")'
Pattern details
^\d{10}(?=\s):
^ - string start
\d{10} - ten digits
(?=\s) - a positive lookahead that requires a whitespace char immediately to the right of the current location
| - or
(?<=")\d{10}(?="):
(?<=") - a " char
\d{10} - ten digits
(?=") - a positive lookahead that requires a double quotation mark immediately to the right of the current location.
You could use lookarounds, but I think this solution is simpler, if you can just get the group:
"?(\d{10})(?:\"|\s)
EDIT:
Considering if there is a first " there must be a ", try this:
(^\d{10}\s|(?<=\")\d{10}(?=\"))
EDIT 2:
To also remove the trailing space in the end, use a lookahead too:
(^\d{10}(?=\s)|(?<=\")\d{10}(?=\"))

How to only capture first group in regex? [duplicate]

This question already has answers here:
Regular expression to match string starting with a specific word
(10 answers)
Closed 2 years ago.
This is the text I am referring to:
' High 4:55AM 1.3m Low 11:35AM 0.34m High 5:47PM 1.12m Low 11:40PM 0.47m First Light 5:59AM Sunrise 6:24AM Sunset 5:01PM Last Light 5:27PM '
Using Python and regex, I only want to capture: "High 4:55AM 1.3m Low 11:35AM 0.34" (which is the first part of the text, and ideally I'd like to capture it without the extra spaces).
I've tried this regex so far: .{44}
It manages to capture the group of text I want, which is the first 44 characters, but it also captures subsequent groups of 44 characters which I don't want.
If you really just want the first 44 characters, you don't need a regex: you can simply use the Python string-slice operator:
first_44_characters = s[:44]
However, a regex is much more powerful, and could account for the fact that the length of the section you're interested in might change. For example, if the time is 10AM instead of 4AM the length of that part might change (or might not, maybe that's what the space padding is for?). In that case, you can capture it with a regex like this:
>>> re.match(r'\s+(High.*?)m', s).group(1)
'High 4:55AM 1.3'
\s matches any whitespace character, + matches one or more of the preceding element, the parentheses define a group starting with High and containing a minimal-length sequence of any character, and the m after the parentheses says the group ends right before a lowercase m character.
If you want, you can also use the regex to extract the individual parts of the sequence:
>>> re.match(r'\s+(High)\s+(\d+\:\d+)(AM|PM)\s+(\d+\.\d+)m', s).groups()
('High', '4:55', 'AM', '1.3')
This regex will capture everything starting with the first "High" until the next "High" (not included), or the end of string if no next one. It gets rid of the extra spaces at beginning and end of catured group.
^\s*(High.*?)\s*(?=$|High)
if you want to reduce all multiple spaces to single ones inside the captured group, you can use a replace function by replacing this regex " +" with " " afterwards

regex conditional matching

I am trying to use re.findall to find this pattern:
01-234-5678
regex:
(\b\d{2}(?P<separator>[-:\s]?)\d{2}(?P=separator)\d{3}(?P=separator)\d{3}(?:(?P=separator)\d{4})?,?\.?\b)
however, some cases have shortened to 01-234-5 instead of 01-234-0005 when the last four digits are 3 zeros followed by a non-zero digit.
Since there does't seem to be any uniformity in formatting I had to account for a few different separator characters or possibly none at all. Luckily, I have only noticed this shortening when some separator has been used...
Is it possible to use a regex conditional to check if a separator does exist (not an empty string), then also check for the shortened variation?
So, something like if separator != '': re.findall(r'(\b\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)(\d{4}|\d{1})\.?\b)', text)
Or is my only option to include all the possibly incorrect 6 digit patterns then check for a separator with python?
If you want the last group of digits to be "either one or four digits", try:
>>> import re
>>> example = "This has one pattern that you're expecting, 01-234-5678, and another that maybe you aren't: 23:456:7"
>>> pattern = re.compile(r'\b(\d{2}(?P<sep>[-:\s]?)\d{3}(?P=sep)\d(?:\d{3})?)\b')
>>> pattern.findall(example)
[('01-234-5678', '-'), ('23:456:7', ':')]
The last part of the pattern, \d(?:\d{3})?), means one digit, optionally followed by three more (i.e. one or four). Note that you don't need to include the optional full stop or comma, they're already covered by \b.
Given that you don't want to capture the case where there is no separator and the last section is a single digit, you could deal with that case separately:
r'\b(\d{9}|\d{2}(?P<sep>[-:\s])\d{3}(?P=sep)\d(?:\d{3})?)\b'
# ^ exactly nine digits
# ^ or
# ^ sep not optional
See this demo.
It is not clear why you are using word boundaries, but I have not seen your data.
Otherwise you can shorten the entire this to this:
re.compile(r'\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)\d{1,4}')
Note that \d{1,4} matched a string with 1, 2, 3 or 4 digits
If there is no separator, e.g. "012340008" will match the regex above as you are using [-:\s]? which matches 0 or 1 times.
HTH

Categories