Extracting Data with Python Regular Expressions

Extracting Data with Python Regular Expressions - python

I am having some trouble wrapping my head around Python regular expressions to come up with a regular expression to extract specific values.
The page I am trying to parse has a number of productIds which appear in the following format
\"productId\":\"111111\"
I need to extract all the values, 111111 in this case.

t = "\"productId\":\"111111\""
m = re.match("\W*productId[^:]*:\D*(\d+)", t)
if m:
print m.group(1)
meaning match non-word characters (\W*), then productId followed by non-column characters ([^:]*) and a :. Then match non-digits (\D*) and match and capture following digits ((\d+)).
Output
111111

something like this:
In [13]: s=r'\"productId\":\"111111\"'
In [14]: print s
\"productId\":\"111111\"
In [15]: import re
In [16]: re.findall(r'\d+', s)
Out[16]: ['111111']

The backslashes here might add to the confusion, because they are used as an escape character both by (non-raw) Python strings and by the regexp syntax.
This extracts the product ids from the format you posted:
re_prodId = re.compile(r'\\"productId\\":\\"([^"]+)\\"')
The raw string r'...' does away with one level of backslash escaping; the use of a single quote as the string delimiter does away with the need to escape double quotes; and finally the backslashe are doubled (only once) because of their special meaning in the regexp language.
You can use the regexp object's findall() method to find all matches in some text:
re_prodId.findall(text_to_search)
This will return a list of all product ids.

Try this,
:\\"(\d*)\\"
Give more examples of your data if this doesn't do what you want.

Related

Regular expression error: unbalanced parenthesis at position n

I have been meaning to extract the month name from the following string with regex and despite the fact that my regex works on a platform like regex101, I can't seem to be able to extract the word "August".
import re
s = "word\anyword\2021\August\202108_filename.csv"
re.findall("\d+\\([[:alpha:]]+)\\\d+", s)
Which results in the following error:
error: unbalanced parenthesis at position 17
I also tried using re.compile, re.escape as per suggestions of the previous posts dealing with the same error but none of them seems to work.
Any help and also a little explanation on why this isn't working is greatly appreciated.

You can use
import re
s = r"word\anyword\2021\August\202108_filename.csv"
m = re.search(r"\d+\\([a-zA-Z]+)\\\d+", s)
if m:
print(m.group(1))
See the Python demo.
There are three main problems here:
The input string should be the same as used at regex101.com, i.e. you need to make sure you are using literal backslashes in the Python code, hence the use of raw string literals for both the input text and regex
The POSIX character classes are not supported by Python re, so [[:alpha:]]+ should be replaced with some equivalent pattern, say, [A-Za-z]+ or [^\W\d_]+
Since it seems like you only expect a single match (there is only one August (month) name in the string), you do not need re.findall, you can use re.search. Only use re.findall when you need to extract multiple matches from a string.
Also, see these posts:
Python regex - r prefix
What does the "r" in pythons re.compile(r' pattern flags') mean?
What exactly do "u" and "r" string flags do, and what are raw string literals?

Regular expression match / split

I am having some trouble trying to figure out how to use regular expressions in python. Ultimately I am trying to do what sscanf does for me in C.
I am trying to match given strings that look like so:
12345_arbitrarystring_2020_05_20_10_10_10.dat
I (seem) to be able to validate this format by calling match on the following regular expression
regex = re.compile('[0-9]{5}_.+_[0-9]{4}([-_])[0-9]{2}([-_])[0-9]{2}([-_])[0-9]{2}([:_])[0-9]{2}([:_])[0-9]{2}\\.dat')
(Note that I do allow for a few other separators then just '_')
I would like to split the given string on these separators so I do:
regex = re.compile('[_\\-:.]+')
parts = regex.split(given_string)
This is all fine .. the problem is that I would like my 'arbitrarystring' part to include '-' and '_' and the last split currently, well, splits them.
Other than manually cutting the timestamp and the first 5 digits off that given string, what can I do to get that arbitrarystring part?

You could use a capturing group to get the arbitrarystring part and omit the other capturing groups.
You could for example use a character class to match 1+ word characters or a hyphen using [\w-]+
If you still want to use split, you could add capturing groups for the first and the second part, and split only those groups.
^[0-9]{5}_([\w-]+)_[0-9]{4}[-_][0-9]{2}[-_][0-9]{2}[-_][0-9]{2}[:_][0-9]{2}[:_][0-9]{2}\.dat$
^^^^^^^^
Regex demo

It seems to be possible to cut down your regex to validate the whole pattern to:
^\d{5}_(.+?)_\d{4}[-_](?:\d{2}[-_]){2}(?:\d{2}[:_]){2}\d{2}\.dat$
Refer to group 1 for your arbitrary string.
Online demo
Quick reminder: You didn't seem to have used raw strings, but instead escaping with a double backslash. Python has raw strings which makes you don't have to escape backslashes nomore.

Python regex number by looking behind

I am extracting numbers in such format string.
AB1234
AC1234
AD1234
As you see, A is always there and the second char excludes ". I write below code to extract number.
re.search(r'(?<=A[^"])\d*',input)
But I encountered an error.
look-behind requires fixed-width pattern
So is there any convenient way to extract numbers? Now I know how to search twice to get them.Thanks in advance.
Note A is a pattern , in fact A is a world in a long string.

The regex in your example works, so I'm guessing your actual pattern has variable width character matches (*, +, etc). Unfortunately, regex look behinds do not support those. What I can suggest as an alternative, is to use a capture group and extract the matching string -
m = re.search(r'A\D+(\d+)', s)
if m:
r = m.group(1)
Details
A # your word
\D+ # anything that is not a digit
( # capture group
\d+ # 1 or more digits
)
If you want to take care of double quotes, you can make a slight modification to the regular expression by including a character class -
r'A[^\d"]+(\d+)'

Tye using this regex instead:
re.search(r'(?=A[^"]\d*)\d*',input)

python regular expression not matching properly

I have a string
"aaabbbbccc"
I want to retrieve
["aaa", "bbbb", "ccc"]
According to this post
What regex can match sequences of the same character?
In [8]: re.findall('(\w)\1+', s)
Out[8]: []
I think I successfully retrieved this pattern using a online regex parser.

There are two things you should consider here:
1) Use raw string literals when defining regex (or double escape the \ inside the pattern so that \1 could be parsed as a backreference and not as an octal character notation), and
2) Use re.finditer here to get whole match values since re.findall will fetch only the values captured with capturing groups:
import re
s = 'aaabbbbccc'
print([x.group() for x in re.finditer(r'(\w)\1+', s)])
See the Python demo.
Here, x.group() is the whole match stored inside the re.MatchObject that is returned by re.finditer.

Python regex non-capturing issue?

I was trying to get all quoted (" or ') substrings from a string excluding the quotation marks.
I came up with this:
"((?:').*[^'](?:'))|((?:\").*[^\"](?:\"))"
For some reason the matching string still contains the quotation marks in it.
Any reason why ?
Sincerely, nikita.utiu.

You could do it with lookahead and lookbehind assertions:
>>> match = re.search(r"(?<=').*?(?=')", "a 'quoted' string. 'second' quote")
>>> print match.group(0)
quoted

Using non-capturing groups doesn’t mean that they are not captured at all. They just don’t create separate capturing groups like normal groups do.
But the structure of the regular expression requires that the quotation marks are part of the match:
"('[^']*'|\"[^\"]*\")"
Then just remove the surrounding quotation marks when processing the matched parts with matched_string[1:-1].

You could try:
import shlex
...
lexer = shlex.shlex(your_input_string)
quoted = [piece.strip("'\"") for piece in lexer if piece.startswith("'") or piece.startswith('"')]
shlex (lexical analysis) takes care of escaped quotes for you. Though note that it does not work with unicode strings.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting Data with Python Regular Expressions - python

t = "\"productId\":\"111111\"" m = re.match("\WproductId[^:]:\D(\d+)", t) if m: print m.group(1) meaning match non-word characters (\W), then productId followed by non-column characters ([^:]) and a :. Then match non-digits (\D) and match and capture following digits ((\d+)). Output 111111

something like this: In [13]: s=r'\"productId\":\"111111\"' In [14]: print s \"productId\":\"111111\" In [15]: import re In [16]: re.findall(r'\d+', s) Out[16]: ['111111']

Try this, :\\"(\d*)\\" Give more examples of your data if this doesn't do what you want.

Related

Regular expression error: unbalanced parenthesis at position n

Regular expression match / split

Python regex number by looking behind

python regular expression not matching properly

Python regex non-capturing issue?

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting Data with Python Regular Expressions - python

t = "\"productId\":\"111111\"" m = re.match("\W*productId[^:]*:\D*(\d+)", t) if m: print m.group(1) meaning match non-word characters (\W*), then productId followed by non-column characters ([^:]*) and a :. Then match non-digits (\D*) and match and capture following digits ((\d+)). Output 111111

something like this: In [13]: s=r'\"productId\":\"111111\"' In [14]: print s \"productId\":\"111111\" In [15]: import re In [16]: re.findall(r'\d+', s) Out[16]: ['111111']

Try this, :\\"(\d*)\\" Give more examples of your data if this doesn't do what you want.

Related

Regular expression error: unbalanced parenthesis at position n

Regular expression match / split

Python regex number by looking behind

python regular expression not matching properly

Python regex non-capturing issue?

Categories

Resources

t = "\"productId\":\"111111\"" m = re.match("\WproductId[^:]:\D(\d+)", t) if m: print m.group(1) meaning match non-word characters (\W), then productId followed by non-column characters ([^:]) and a :. Then match non-digits (\D) and match and capture following digits ((\d+)). Output 111111