How to only capture first group in regex? [duplicate] - python

This question already has answers here:
Regular expression to match string starting with a specific word
(10 answers)
Closed 2 years ago.
This is the text I am referring to:
' High 4:55AM 1.3m Low 11:35AM 0.34m High 5:47PM 1.12m Low 11:40PM 0.47m First Light 5:59AM Sunrise 6:24AM Sunset 5:01PM Last Light 5:27PM '
Using Python and regex, I only want to capture: "High 4:55AM 1.3m Low 11:35AM 0.34" (which is the first part of the text, and ideally I'd like to capture it without the extra spaces).
I've tried this regex so far: .{44}
It manages to capture the group of text I want, which is the first 44 characters, but it also captures subsequent groups of 44 characters which I don't want.

If you really just want the first 44 characters, you don't need a regex: you can simply use the Python string-slice operator:
first_44_characters = s[:44]
However, a regex is much more powerful, and could account for the fact that the length of the section you're interested in might change. For example, if the time is 10AM instead of 4AM the length of that part might change (or might not, maybe that's what the space padding is for?). In that case, you can capture it with a regex like this:
>>> re.match(r'\s+(High.*?)m', s).group(1)
'High 4:55AM 1.3'
\s matches any whitespace character, + matches one or more of the preceding element, the parentheses define a group starting with High and containing a minimal-length sequence of any character, and the m after the parentheses says the group ends right before a lowercase m character.
If you want, you can also use the regex to extract the individual parts of the sequence:
>>> re.match(r'\s+(High)\s+(\d+\:\d+)(AM|PM)\s+(\d+\.\d+)m', s).groups()
('High', '4:55', 'AM', '1.3')

This regex will capture everything starting with the first "High" until the next "High" (not included), or the end of string if no next one. It gets rid of the extra spaces at beginning and end of catured group.
^\s*(High.*?)\s*(?=$|High)
if you want to reduce all multiple spaces to single ones inside the captured group, you can use a replace function by replacing this regex " +" with " " afterwards

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

I feel I am having the most difficulty explaining this well enough for a search engine to pick up on what I'm looking for. The behavior is essentially this:
string = "aaaaaaaaare yooooooooou okkkkkk"
would become "aare yoou okk", with the maximum number of repeats for any given character is two.
Matching the excess duplicates, and then re.sub -ing it seems to me the approach to take, but I can't figure out the regex statement I need.
The only attempt I feel is even worth posting is this - (\w)\1{3,0}
Which matched only the first instance of a character repeating more than three times - so only one match, and the whole block of repeated characters, not just the ones exceeding the max of 2. Any help is appreciated!
The regexp should be (\w)\1{2,} to match a character followed by at least 2 repetitions. That's 3 or more when you include the initial character.
The replacement is then \1\1 to replace with just two repetitions.
string = "aaaaaaaaare yooooooooou okkkkkk"
new_string = re.sub(r'(\w)\1{2,}', r'\1\1', string)
You could write
string = "aaaaaaaaare yooooooooou okkkkkk"
rgx = (\w)\1*(?=\1\1)
re.sub(rgx, '', string)
#=> "aare yoou okk"
Demo
The regular expression can be broken down as follows.
(\w) # match one word character and save it to capture group 1
\1* # match the content of capture group 1 zero or more times
(?= # begin a positive lookahead
\1\1 # match the content of capture group 1 twice
) # end the positive lookahead

Splitting a string via Regex and Maxsplit returns multiple splits

I have a list of strings taken from chat log data, and I am trying to find the optimal method of splitting the speaker from the content of speech. Two examples are as follows:
mystr = ['bob123 (5:09:49 PM): hi how are you',
'jane_r16 (12/01/2020 1:39:12 A.M.) : What day is it today?']
Note that, while they are broadly similar, there are some stylistic differences I need to account for (inclusion of dates, period marks, extra spaces etc.). I require a way to standardize and split such strings, and others like these, into something like the following list:
mystrList = [['bob123','hi how are you'],['jane_r16','What day is it today']]
Given that I do not need the times, numbers, or most punctuation, i thought a reasonable first step would be to remove anything non-essential. After doing so, I now have the following:
myCleanstr = ['bob(): hi how are you','janer() : What day is it today?']
Doing this has given me a pretty unique sequence of characters per string (): that is unlikely to appear elsewhere in the same string. My subsequent thinking was to use this as a de-marker to split each string using Regex:
mystr_split = [re.split(r'\(\)( ){,2}:', i, maxsplit=1, flags=re.I) for i in myCleanstr]
Here, my intention was the following:
\(\) Find a sequence of an open followed by a closed parentheses symbol
( ){,2} Then find zero, one, or two whitespaces
: Then find a colon symbol
However, in both instances, I receive three objects per string. I get the correct speaker ID, and speech content. But, in the first string I get an additional NoneType Object, and in the second string I get an additional string filled with a single white-space.
I had assumed that including maxsplit=1 would mean that the process would end after the first split has been found, but this doesn't appear to be the case. Rather than filter my results on the content I need I would like to understand why it is performing as it is.
You can use
^(\S+)\s*\([^()]*\)\s*:\s*(.+)
Or, if the name can have whitespaces:
^(\S[^(]*?)\s*\([^()]*\)\s*:\s*(.+)
See the regex demo #1 and regex demo #2. The regex matches:
^ - start of string
(\S+) - Group 1: any one or more whitespace chars
[^(]*? - zero or more chars other than a ( char, as few as possible
\s* - zero or more whitespaces
\( - a ( char
[^()]* - zero or more chars other than ( and )
\) - a ) char
\s*:\s* - a colon enclosed with zero or more whitespaces
(.+) - Group 2: any one or more chars other than line break chars, as many as possible (the whole rest of the line).
See the Python demo:
import re
result = []
mystr = ['bob123 (5:09:49 PM): hi how are you', 'jane_r16 (12/01/2020 1:39:12 A.M.) : What day is it today?']
for s in mystr:
m = re.search(r'^(\S+)\s*\([^()]*\)\s*:\s*(.+)', s)
if m:
result.append([z for z in m.groups()])
print(result)
# => [['bob123', 'hi how are you'], ['jane_r16', 'What day is it today?']]

regex - Extract complete word base on match string [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 2 years ago.
Can some one please help me on this - Here I'm trying extract word from given sentence which contains G,ML,KG,L,ML,PCS along with numbers .
I can able to match the string , but not sure how can I extract the comlpete word
for example my input is "This packet contains 250G Dates" and output should be 250G
another example is "You paid for 2KG Apples" and output should be 2KG
in my regular expression I'm getting only match string not complete word :(
import re
val = 'FUJI ALUMN FOIL CAKE, 240G, CHCLTE'
key_vals = ['G','GM','KG','L','ML','PCS']
re.findall("\d+\.?\d*(\s|G|KG|GM|L|ML|PCS)\s?", val)
This regex will not get you what you want:
re.findall("\d+\.?\d*(\s|G|KG|GM|L|ML|PCS)\s?", val)
Let's break it down:
\d+: one or more digits
\.?: a dot (optional, as indicated by the question mark)
\d*: one or more optional digits
(\s|G|KG|GM|L|ML|PCS): a group of alternatives, but whitespace is an option among others, it should be out of the group: what you probably want is allow optional whitespace between the number and the unit ie: 240G or 240 G
\s?: optional whitespace
A better expression for your purpose could be:
re.findall("\d+\s*(?:G|KG|GM|L|ML|PCS)", val)
That means: one or more digits, followed by optional whitespace and then either of these units: G|KG|GM|L|ML|PCS.
Note the presence of ?: to indicate a non-capturing group. Without it the expression would return G
Try using this Regex:
\d+\s*(G|KG|GM|L|ML|PCS)\s?
It matches every string which starts with at least one digit, is then followed by one the units. Between the digits and the units and behind the units there can also be whitespaces.
Adjust this like you want to :)
Use non-grouping parentheses (?:...) instead of the normal ones. Without grouping parentheses findall returns the string(s) which match the whole pattern.

How to group inside "or" matching in a regex?

I have two kinds of documents to parse:
1545994641 INFO: ...
and
'{"deliveryDate":"1545994641","error"..."}'
I want to extract the timestamp 1545994641 from each of them.
So, I decided to write a regex to match both cases:
(\d{10}\s|\"\d{10}\")
In the 1st kind of document, it matches the timestamp and groups it, using the first expression in the "or" above (\d{10}\s):
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg="1545994641 INFO: ..."
>>> regex.search(msg).group(0)
'1545994641 '
(So far so good.)
However, in the 2nd kind, using the second expression in the "or" (\"\d{10}\") it matches the timestamp and quotation marks, grouping them. But I just want the timestamp, not the "":
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg='{"deliveryDate":"1545994641","error"..."}'
>>> regex.search(msg).group(0)
'"1545994641"'
What I tried:
I decided to use a non-capturing group for the quotation marks:
(\d{10}\s|(?:\")\d{10}(?:\"))
but it doesn't work as the outer group catches them.
I also removed the outer group, but the result is the same.
Unwanted ways to solve:
I can surpass this by creating a group for each expression in the or,
but I just want it to output a single group (to abstract the code
from the regex).
I could also use a 2nd step of regex to capture the timestamp from
the group that has the quotation marks, but again that would break
the code abstraction.
I could omit the "" in the regex but that would match a timestamp in the middle of the message , as I want it to be objective to capture the timestamp as a value of a key or in the beginning of the document, followed by a space.
Is there a way I can match both cases above but, in the case it matches the second case, return only the timestamp? Or is it impossible?
EDIT:
As noticed by #Amit Bhardwaj, the first case also returns a space after the timestamp. It's another problem (I didn't figure out) with the same solution, probably!
You may use lookarounds if your code can only access the whole match:
^\d{10}(?=\s)|(?<=")\d{10}(?=")
See the regex demo.
In Python, declare it as
rx = r'^\d{10}(?=\s)|(?<=")\d{10}(?=")'
Pattern details
^\d{10}(?=\s):
^ - string start
\d{10} - ten digits
(?=\s) - a positive lookahead that requires a whitespace char immediately to the right of the current location
| - or
(?<=")\d{10}(?="):
(?<=") - a " char
\d{10} - ten digits
(?=") - a positive lookahead that requires a double quotation mark immediately to the right of the current location.
You could use lookarounds, but I think this solution is simpler, if you can just get the group:
"?(\d{10})(?:\"|\s)
EDIT:
Considering if there is a first " there must be a ", try this:
(^\d{10}\s|(?<=\")\d{10}(?=\"))
EDIT 2:
To also remove the trailing space in the end, use a lookahead too:
(^\d{10}(?=\s)|(?<=\")\d{10}(?=\"))

python regex tokenization custom characters to be splitting points

I want to tokenize a sentence (purely with regex without having to install NLTK or similar). I want the tokenizer to:
split around hyphens '-' and apostrophes "'" --> e.g. ( I haven't heard good-news:: I haven ' t heard good - news )
split around all other characters if only followed by a space --> e.g. ( I have, some 16,000.13 dollars (A) to spare :: I have , some 16,000.13 dollars ( A ) to spare)
I created this function, but it still does not split around "(":
def tok(txt): #we want to make sure that only apostrophs and hyphens are splitting points, and all other non alpha characters not followed by space to be non-spliting points
sub=re.sub(r'(?u)(\W)(?!\S)',r' \1 ',txt)
sub=re.sub(r"(?u)([\-\'\[\(\{])",r' \1 ',sub) #will need to add more exceptions
return [v for v in re.split('(?u)\s+',sub) if v]
A Python Oddity: re.split doesn't split on zero-width matches
Most regex engines allow you to split on a zero-width match, i.e., a certain position in the string. For instance, you can use the lookbehind (?<=&) to split if the previous character is a &. However, Python does not allow you to split on zero-width matches—unless you use the regex module with the V1 flag turned on. To see this, try:
re.split("(?=&)", "a&fine&tree")
and
regex.split("(?V1)(?=&)", "a&fine&tree")
Splitting vs. Match All
So if we want to stick to re, splitting may be awkward. Luckily, splitting a string and matching all the tokens of interest are two sides of the same coin. In this case, matching is faster and gives you the same array.
Please note that I have adjusted your rules based on your desired output, but we can tweak them back. For instance, it does sound like you want to want to split unconditionally around ( since there is no space after ( in (A. Also digits sound like they should be treated like letters.
Just use this:
result = re.findall("[-'()]|[^a-z0-9 ](?= )|(?:[a-z0-9]|[^-'()a-z0-9 ](?! ))+", subject, re.IGNORECASE)
The tokens:
I
haven
'
t
heard
good
-
news
I
have
,
some
16,000.13
dollars
(
A
)
to
spare
References
The Elements of Good Regex Style (search for "split")
re does not split on zero-width matches

Categories