need regex expression to avoid " \n " character - python

I want to apply regex to the below string in python Where i only want to capture Model Number : 123. I tried the below regex but it didn't fetch me the result.
string = """Model Number : 123
Serial Number : 456"""
model_number = re.findall(r'(?s)Model Number:.*?\n',string)
Output is as follows Model Number : 123\n How can i avoid \n at the end of the output?

Remove the DOTALL (?s) inline modifier to avoid matching a newline char with ., add \s* after Number and use .* instead of .*?\n:
r'Model Number\s*:.*'
See the regex demo
Here, Model Number will match a literal substring, \s* will match 0+ whitespaces, : will match a colon and .* will match 0 or more chars other than line break chars.
Python demo:
import re
s = """Model Number : 123
Serial Number : 456"""
model_number = re.findall(r'Model Number\s*:.*',s)
print(model_number) # => ['Model Number : 123']
If you need to extract just the number use
r'Model Number\s*:\s*(\d+)'
See another regex demo and this Python demo.
Here, (\d+) will capture 1 or more digits and re.findall will only return these digits. Or, use it with re.search and once the match data object is obtained, grab it with match.group(1).
NOTE: If the string appears at the start of the string, use re.match. Or add ^ at the start of the pattern and use re.M flag (or add (?m) at the start of the pattern).

you can use strip() function
model_number.strip()
this will remove all white spaces

Related

How to create regex to match a string that contains only hexadecimal numbers and arrows?

I am using a string that uses the following characters:
0-9
a-f
A-F
-
>
The mixture of the greater than and hyphen must be:
->
-->
Here is the regex that I have so far:
[0-9a-fA-F\-\>]+
I tried these others using exclusion with ^ but they didn't work:
[^g-zG-Z][0-9a-fA-F\-\>]+
^g-zG-Z[0-9a-fA-F\-\>]+
[0-9a-fA-F\-\>]^g-zG-Z+
[0-9a-fA-F\-\>]+^g-zG-Z
[0-9a-fA-F\-\>]+[^g-zG-Z]
Here are some samples:
"0912adbd->12d1829-->218990d"
"ab2c8d-->82a921->193acd7"
Firstly, you don't need to escape - and >
Here's the regex that worked for me:
^([0-9a-fA-F]*(->)*(-->)*)*$
Here's an alternative regex:
^([0-9a-fA-F]*(-+>)*)*$
What does the regex do?
^ matches the beginning of the string and $ matches the ending.
* matches 0 or more instances of the preceding token
Created a big () capturing group to match any token.
[0-9a-fA-F] matches any character that is in the range.
(->) and (-->) match only those given instances.
Putting it into a code:
import re
regex = "^([0-9a-fA-F]*(->)*(-->)*)*$"
re.match(re.compile(regex),"0912adbd->12d1829-->218990d")
re.match(re.compile(regex),"ab2c8d-->82a921->193acd7")
re.match(re.compile(regex),"this-failed->so-->bad")
You can also convert it into a boolean:
print(bool(re.match(re.compile(regex),"0912adbd->12d1829-->218990d")))
print(bool(re.match(re.compile(regex),"ab2c8d-->82a921->193acd7")))
print(bool(re.match(re.compile(regex),"this-failed->so-->bad")))
Output:
True
True
False
I recommend using regexr.com to check your regex.
If there must be an arrow present, and not at the start or end of the string using a case insensitive pattern:
^[a-f\d]+(?:-{1,2}>[a-f\d]+)+$
Explanation
^ Start of string
[a-f\d]+ Match 1+ chars a-f or digits
(?: Non capture group to repeat as a whole
-{1,2}>[a-f\d]+ Match - or -- and > followed by 1+ chars a-f or digits
)+ Close the non capture group and repeat 1+ times
$ End of string
See a regex demo and a Python demo.
import re
pattern = r"^[a-f\d]+(?:-{1,2}>[a-f\d]+)+$"
s = ("0912adbd->12d1829-->218990d\n"
"ab2c8d-->82a921->193acd7\n"
"test")
print(re.findall(pattern, s, re.I | re.M))
Output
[
'0912adbd->12d1829-->218990d',
'ab2c8d-->82a921->193acd7'
]
You can construct the regex by steps. If I understand your requirements, you want a sequence of hexadecimal numbers (like a01d or 11efeb23, separated by arrows with one or two hyphens (-> or -->).
The hex part's regex is [0-9a-fA-F]+ (assuming it cannot be empty).
The arrow's regex can be -{1,2}> or (->|-->).
The arrow is only needed before each hex number but the first, so you'll build the final regex in two parts: the first number, then the repetition of arrow and number.
So the general structure will be:
NUMBER(ARROW NUMBER)*
Which gives the following regex:
[0-9a-fA-F]+(-{1,2}>[0-9a-fA-F]+)*

Test for comma delimited string, ignoring any encountered periods, say from real numbers?

The following works for a simple comma delimited string, that has no periods, but if periods in real numbers found it breaks.
pattern = re.compile(r"^(\w+)(,\s*\w+)*$")
How can I modify or change the above to ignore periods? But still validate the given string is comma delimited?
A sample test string is "23,HIGH,1.0,LOW,1.0,HIGH,1.0,LOW,1.0".
\w matches "word" characters: letters, digits and _. It doesn't match a dot. If you want to match dots as well, use [\w.] instead of \w:
pattern = re.compile(r"^([\w.]+)(,\s*[\w.]+)*$")
You might also want to add -, if you expect negative numbers. To put - in a character class, you either have to backslash escape it or make sure it's either the first or last character in the class:
[-.\w]
[\w.-]
[\w\-.]
If the value can only be a number, and matching dots only would not be desired you can use and alternation to match either word characters or a number.
^(?:[+-]?\d*\.?\d+|\w+)(?:,(?:[+-]?\d*\.?\d+|\w+))*$
Explanation
^ Start of string
(?: Non capture group
[+-]?\d*\.?\d+ Match an optional + or -, then optional digits, optional dot and 1+ digits
| Or
\w+ Match 1+ word characters
) Close non capture group
(?: Non capture group
, Match the comma
(?:[+-]?\d*\.?\d+|\w+) The same pattern as in the first part
)* Close non capture group and optionally repeat to also match a single occurrence
$ End of string
Regex demo

Regex to extract first 5 digit+character from last hyphen

I am trying to extract first 5 character+digit from last hyphen.
Here is the example
String -- X008-TGa19-ER751QF7
Output -- X008-TGa19-ER751
String -- X002-KF13-ER782cPU80
Output -- X002-KF13-ER782
My attempt -- I could manage to take element from the last -- (\w+)[^-.]*$
But now how to take first 5, then return my the entire value as the output as shown in the example.
You can optionally repeat a - and 1+ word chars from the start of the string. Then match the last - and match 5 word chars.
^\w+(?:-\w+)*-\w{5}
^ Start of string
\w+ Math 1+ word chars
(?:-\w+)* Optionally repeat - and 1+ word chars
-\w{5} Match - and 5 word chars
Regex demo
import re
regex = r"^\w+(?:-\w+)*-\w{5}"
s = ("X008-TGa19-ER751QF7\n"
"X002-KF13-ER782cPU80")
print(re.findall(regex, s, re.MULTILINE))
Output
['X008-TGa19-ER751', 'X002-KF13-ER782']
Note that \w can also match _.
If there can also be other character in the string, to get the first 5 digits or characters except _ after the last hyphen, you can match word characters without an underscore using a negated character class [^\W_]{5}
Repeat that 5 times while asserting no more underscore at the right.
^.*-[^\W_]{5}(?=[^-]*$)
Regex demo
(\w+-\w+-\w{5}) seems to capture what you're asking for.
Example:
https://regex101.com/r/PcPSim/1
If you are open for non-regex solution, you can use this which is based on splitting, slicing and joining the strings:
>>> my_str = "X008-TGa19-ER751QF7"
>>> '-'.join(s[:5] for s in my_str.split('-'))
'X008-TGa19-ER751'
Here I am splitting the string based on hyphen -, slicing the string to get at max five chars per sub-string, and joining it back using str.join() to get the string in your desired format.
^(.*-[^-]{5})[^-]*$
Capture group 1 is what you need
https://regex101.com/r/SYz9i5/1
Explanation
^(.*-[^-]{5})[^-]*$
^ Start of line
( Capture group 1 start
.* Any number of any character
- hyphen
[^-]{5} 5 non-hyphen character
) Capture group 1 end
[^-]* Any number of non-hyphen character
$ End of line
Another simpler one is
^(.*-.{5}).*$
This should be quite straight-forward.
This is making use of behaviour greedy match of first .*, which will try to match as much as possible, so the - will be the last one with at least 5 character following it.
https://regex101.com/r/CFqgeF/1/

Regex Search end of line and beginning of next line

Trying to come up with a regex to search for keyword match at end of line and beginning of next line(if present)
I have tried below regex and does not seem to return desired result
re.compile(fr"\s(?!^)(keyword1|keyword2|keyword3)\s*\$\n\r\((\w+\W+|W+\w+))", re.MULTILINE | re.IGNORECASE)
My input for example is
sentence = """ This is my keyword
/n value"""
Output in above case should be keyword value
Thanks in advance
You could match the keyword (Or use an alternation) to match more keywords and take trailing tabs and spaces into account after the keyword and after matching a newline.
Using 2 capturing groups as in the pattern you tried:
(?<!\S)(keyword)[\t ]*\r?\n[\t ]*(\w+)(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert what is directly on the left is not a non whitespace char
(keyword) Capture in group 1 matching the keyword
[\t ]* Match 0+ tabs or spaces
\r?\n Match newline
[\t ]* Match 0+ tabs or spaces
(\w+) Capture group 2 match 1+ word chars
(?!\S) Negative lookahead, assert what is directly on the right is not a non whitespace char
Regex demo | Python demo
For example:
import re
regex = r"(?<!\S)(keyword)[\t ]*\r?\n[\t ]*(\w+)(?!\S)"
test_str = (" This is my keyword\n"
" value")
matches = re.search(regex, test_str)
if matches:
print('{} {}'.format(matches.group(1), matches.group(2)))
Output
keyword value
How about \b(keyword)\n(\w+)\b?
\b(keyword)\n(\w+)\b
\b get a word boundary
(keyword) capture keyword (replace with whatever you want)
\n match a newline
(\w+) capture some word characters, one or more
\b get a word boundary
Because keyword and \w+ are in capture groups, you can reference them as you wish later in your code.
Try it here!
My guess is that, depending of the number of new lines that you might have, an expression similar to:
\b(keyword1|keyword2|keyword3)\b[r\n]{1,2}(\S+)
might be somewhat close and the value is in \2, you can make the first group non-captured, then:
\b(?:keyword1|keyword2|keyword3)\b[r\n]{1,2}(\S+)
\1 is the value.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

Matching newline and any character with Python regex

I have a text like
var12.1
a
a
dsa
88
123!!!
secondVar12.1
The string between var and secondVar may be different (and there may be different count of them).
How can I dump it with regexp?
I'm trying something something like this to no avail:
re.findall(r"^var[0-9]+\.[0-9]+[\n.]+^secondVar[0-9]+\.[0-9]+", str, re.MULTILINE)
You can grab it with:
var\d+(?:(?!var\d).)*?secondVar
See demo. re.S (or re.DOTALL) modifier must be used with this regex so that . could match a newline. The text between the delimiters will be in Group 1.
NOTE: The closest match will be matched due to (?:(?!var\d).)*? tempered greedy token (i.e. if you have another var + a digit after var + 1+ digits then the match will be between the second var and secondVar.
NOTE2: You might want to use \b word boundaries to match the words beginning with them: \bvar(?:(?!var\d).)*?\bsecondVar.
REGEX EXPLANATION
var - match the starting delimiter
\d+ - 1+ digits
(?:(?!var\d).)*? - a tempered greedy token that matches any char, 0 or more (but as few as possible) repetitions, that does not start a char sequence var and a digit
secondVar - match secondVar literally.
IDEONE DEMO
import re
p = re.compile(r'var\d+(?:(?!var\d).)*?secondVar', re.DOTALL)
test_str = "var12.1\na\na\ndsa\n\n88\n123!!!\nsecondVar12.1\nvar12.1\na\na\ndsa\n\n88\n123!!!\nsecondVar12.1"
print(p.findall(test_str))
Result for the input string (I doubled it for demo purposes):
['12.1\na\na\ndsa\n\n88\n123!!!\n', '12.1\na\na\ndsa\n\n88\n123!!!\n']
You're looking for the re.DOTALL flag, with a regex like this: var(.*?)secondVar. This regex would capture everything between var and secondVar.

Categories