How to extract data from string

How to extract data from string - python

My code is
import regex
word = '\x02|1280|SELECT|35;36|="214554"'.encode('ascii')
pattern = r'^(\x02)\|(\d{1,4})\|(SELECT|UPDATE|INSERT)\|(\d{1,2}+|;*)\|="(\w+)"'.encode('ascii')
print(regex.match(pattern, word).group(4))
and I'm interested in group 4 -> (\d{1,2}+|;*) that can have following pattern
|one digit number|
|two-digit number|
|one/two-digit number; one/two-digit number; ... ; one/two-digit number|
I have tried different combination, but as I'm new to regex none of them returns data from group.

How about changing the pattern for group 4 to: (\d{1,2}(?:;\d{1,2})*)?
\d{1,2} represents one or two digits
(?:;\d{1,2})* represents zero or more non-capturing groups that include a semi colon ; followed by one or two digits numbers
Important to mark the group as non-capturing by adding a (?: at the start
Regex101 Demo
Hope this helps!

The \d{1,2}+|;* pattern matches 1 or 2 digits possessively or 0+ semi-colon. So, it is not what you need.
You need to write the pattern like this:
r'^(\x02)\|(\d{1,4})\|(SELECT|UPDATE|INSERT)\|(\d{1,2}(?:;\d{1,2})*)\|="(\w+)"'
See the Python demo.
The Group 4 pattern will look like (\d{1,2}(?:;\d{1,2})*):
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing group that matches sequences of....
; - a semi-colon
\d{1,2} - 1 or 2 digits
)* - .... zero or more occurrences.

Related

Regex to match phone number 5ABXXYYZZ

I am using regex to match 9 digits phone numbers.
I have this pattern 5ABXXYYZZ that I want to match.
What I tried
I have this regex that matches two repetitions only 5ABCDYYZZ
S_P_2 = 541982277
S_P_2_pattern = re.sub(r"(?!.*(\d)\1(\d)\2(\d)\3).{4}\d(\d)\4(\d)\5", "Special", str(S_P_2))
print(S_P_2_pattern)
What I want to achieve
I would like to update it to match three repetitions 5ABXXYYZZ sample 541882277

Try:
^5\d\d(?:(\d)\1(?!.*\1)){3}$
See an online demo
^5\d\d - Start-line anchor and literal 5 before two random digits;
(?:(\d)\1(?!.*\1)){3} - Non-capture group matched three times with nested capture group followed by itself directly but (due to negative lookahead) again after 0+ chars;
$ - End-line anchor.

Regex for Alternating Numbers

I am trying to write a regex pattern for phone numbers consisting of 9 fixed digits.
I want to identify numbers that have two numbers alternating for four times such as 5XYXYXYXY
I used the below sample
number = 561616161
I tried the below pattern but it is not accurate
^5(\d)(?=\d\1).+
can someone point out what i am doing wrong?

I would use:
^(?=\d{9}$)\d*(\d)(\d)(?:\1\2){3}\d*$
Demo
Here is an explanation of the pattern:
^ from the start of the number
(?=\d{9}$) assert exactly 9 digits
\d* match optional leading digits
(\d) capture a digit in \1
(\d) capture another digit in \2
(?:\1\2){3} match the XY combination 3 more times
\d* more optional digits
$ end of the number

If you want to repeat 4 times 2 of the same pairs and matching 9 digits in total:
^(?=\d{9}$)\d*(\d\d)\1{3}\d*$
Explanation
^ Start of string
(?=\d{9}$) Positive lookahead, assert 9 digits till the end of the string
\d* Match optional digits
(\d\d)\1{3} Capture group 1, match 2 digits and then repeat what is captured in group 1 3 times
\d* Match optional digits
$ End of string
Regex demo
If you want to match a pattern repeating 4 times 2 digits where the 2 digits are not the same:
^(?=\d{9}$)\d*((\d)(?!\2)\d)\1{3}\d*$
Explanation
^ Start of string
(?=\d{9}$) Positive lookahead, assert 9 digits till the end of the string
\d* Match optional digits
( Capture group 1
(\d) Capture group 2, match a single digit
(?!\2)\d Negative lookahead, assert not the same char as captured in group 2. If that is the case, then match a single digit
) Close group 1
\1{3} Repeat the captured value of capture group 1 3 times
\d* Match optional digits
$ End of string
Regex demo

My first guess from OP's self tried regex ^5(\d)(?=\d\1).+ without any own additions was a regex is needed to verify numbers starting with 5 and followed by 4 pairs of same two digits.
^5(\d\d)\1{3}$
Demo at regex101
The same idea with the "added guess" to disallow all same digits like e.g. 511111111
^5((\d)(?!\2)\d)\1{3}$
Demo at regex101
Guessing further that 5 is a variable value and assuming if one variable at start/end with the idea of taking out invalid values early - already having seen the other nice provided answers.
^(?=\d?(\d\d)\1{3})\d{9}$
Demo at regex101
Solution 3 with solution 2's assumption of two different digits in first pairing.
^(?=\d?((\d)(?!\2)\d)\1{3})\d{9}$
Demo at regex101
Solutions 3 and 4 are most obvious playings with #4thBird's nice answer in changed order.

How to exclude two words from regex?

I have this regex:
\]\s*(AN|AV)\s*1\s*([\w\s]+)\s*2\s*([\w\s]+)\s*3\s*([\w\s][^cui]+)
That should match
] AN 1 words 2 words 3 words
or
] AV 1 words 2 words 3 words
The words after 3 should exclude "da cui", so "da\scui", but it doesn't work. Try it here: https://regex101.com/r/kI7Tan/1
What am I doing wrong?
Sample string:
campo] AN 1 campo 2 prato con penna B sps a 1 3 da cui campo con penna C as a 1 cfr Nota filologica
Expected output: it won't match it because of the "da cui". So basically I want to match all words without the string "da cui".

The final capture group of the regex ( ([\w\s][^cui]+) ) matches ...
Exactly 1 word character due to the first character class.
This class does not match a whitespace due to the preceding \s* in the regex.
Any number of characters other than c, u, i.
If you want to exclude matches contingent on the word(s) da cui, use a negative lookahead.
\]\s*(AN|AV)\s*1\s*([\w\s]+)\s*2\s*([\w\s]+)\s*3\s*(?!.*da cui)(.*)
See the demo (regex101).
Update
Capture group reintroduced to the regex.

You may use either of the two:
\]\s*(AN|AV)\s*1\s*([\w\s]+?)\s*2\s*([\w\s]+?)\s*3\s*((?:(?!cui).)*)
\]\s*(AN|AV)\s*1\s*([\w\s]+?)\s*2\s*([\w\s]+?)\s*3\s*(.*?)(?=cui|$)
See the regex demo
The (?:(?!cui).)* is a tempered greedy token that matches any char, 0 or more occurrences, as many as possible, that does not start a cui char sequence. The (.*?)(?=cui|$) pattern captures 0+ chars other than line break chars, as few as possible, up to the cui char sequence or end of string.

My interpretation of the question, as it concerns the string that follows one or more spaces after 3 (to the end of the line), is that if the string da cui is present in that string an empty string is to be saved to capture group 4, else that string is to be saved to capture group 4.
You could use the following regular expression.
\]\s*(AN|AV)\s+1\s+([\w\s]+)\s+2\s+([\w\s]+)\s+3\s+((?=.*\bda cui\b)|(?!=.*\bda cui\b).*)
Demo
This replaces 3\s*([\w\s][^cui]+) in the OP's regex with 3\s+((?=.*\bda cui\b)|(?!=.*\bda cui\b).*).
Python's regex engine performs the following steps after matching 3.
\s+ match 1+ spaces
( begin capture group 4
(?=.*\bda cui\b) match 0+ chars, then 'da cui' in a positive lookahead
| or
(?!=.*\bda cui\b) match 0* chars, then 'da cui' in a negative lookahead
.* match rest of line
) end capture group 4
If the positive lookahead succeeds an empty string is saved to the capture group.

Match capture group fixed number of times

I have a bunch of 5-letter strings. For each string, I would like to match only if the string contains 3 instances of the same letter, i.e.:
Case 1: 'aabbc' -> no match
Case 2: 'bbbcc' -> match 'bbb'
Case 3: 'ddcdc' -> match 'ddd'
My best regex attempt is:
(.){1}(?!\1)*\1{1}(?!\1)*\1{1}
This works for case 1 (where there is no match) and case 2 (where the 3 instances are adjacent), but not for case 3 (where the 3 instances are separated by at least one other letter).
Is there a regex that will work for case 3? Ideally I would like to also extract the locations of the 3 matching instances from the string.

The below pattern catches what you need and should capture the edge cases. The first capturing group can be modified to just be the subset of characters you need to search for if there is a limited list of expected values. Putting the \1s in capturing groups means that you should be able to extract the index of the capturing groups from the match via .start() (getting the starting index of the capturing group), meeting your bonus goal.
>>> pattern = r"(.).*(\1).*(\1)"
>>> x = re.search(pattern, "ababb")
>>> x.groups()
('b', 'b', 'b')
>>> x.start(1)
1
>>> x.start(2)
3
>>> x.start(3)
4

I think the pattern ([a-z]).*?\1.*?\1 does what you want, although there are likely to be edge cases that would complicate it.
The pattern looks for a lowercase letter three times, with 0 or more characters between them.
You could then extract just the capturing groups to get your match locations.
At the moment, the pattern only looks for any lowercase letter repeated three times, but you could change the initial capturing group - ([a-z]) - if you wanted to capture something else.
Demo

You can use the following regex to determine if one character appears at least three times.
^.*(.).*\1.*\1
Demo
This does not check that the characters are letters but it does work with any characters. To restrict to letters change each . to [a-z] or [a-zA-Z], as appropriate.
To see if one character appears exactly 3 times, change the regex to:
^(?!.*(.)(?:.*\1){3,}).*(.).*\2.*\2
Demo
^ # match beginning of line
(?! # begin negative lookahead
.* # match 0+ chars
(.) # match a char in cap grp 2
(?:.*\1) # match 0+ chars followed by content of cap grp 1
# in a non-cap grp
{3,} # execute non-cap grp 3+ times
) # end negative lookahead
.* # match 0+ chars
(.) # match char in cap grp 2
.* # match 0+ chars
\2 # match content of cap grp 2
.* # match 0+ chars
\2 # match content of cap grp 2

Greedy Python RegEx capturing group to include "and"

I need some help writing regex expressions. I need an expression that can match the following patterns (including words and digits, spaces and commas):
Line 145
Line3544354
Lines 10,12
Line items 45,10,26
Lines 10 and 45
Thus far, I wrote one expression which includes the first three patterns and all case variations:
r'(?i)(line item[\.*\,*\s*\d+]+]+|line[\.*\,*\s*\d+]+|lines[\.*\,*\s*\d+]+|line items[\.*\,*\s*\d+]+)'
I would like to include the last two patterns listed but not sure how. I have wrote this expression for the pattern matching "Lines 10 and 45" by modifying the capturing group as follows:
r'(Lines[\.*\,*\w*\s*\d+]+)'
However, it does not work as expected. It selects all word characters in the string. I would like to keep my expressions greedy, but not sure how to implement the last two patterns in the list.
Any suggestions please?

You may use
(?i)lines?(?:\s+items?)?\s*\d+(?:\.\d+)?(?:\s*(?:,|and)\s*\d+(?:\.\d+)?)*
See the regex demo.
Pattern details:
(?i) - ignore case inline modifier
lines? - line or lines (? quantifier makes the preceding pattern optional, matching 1 or 0 occurrences)
(?:\s+items?)? - an optional non-capturing group matching 1 or 0 occurrences of 1+ whitespaces followed with item and an optional s char
\s* - 0+ whitespaces
\d+(?:\.\d+)? - 1+ digits followed with an optional sequence of . and 1+ digits
(?:\s*(?:,|and)\s*\d+(?:\.\d+)?)* - 0 or more repetitions of
\s* - 0+ whitespaces
(?:,|and) - , or and char sequence
\s* - 0+ whitespaces
\d+(?:\.\d+)? - 1+ digits followed with an optional sequence of . and 1+ digits

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract data from string - python

Related

Regex to match phone number 5ABXXYYZZ

Regex for Alternating Numbers

How to exclude two words from regex?

Match capture group fixed number of times

Greedy Python RegEx capturing group to include "and"

Categories

Resources