Regular expression for (=string) - python

I have a text file including thousands of lines. here's an example
line = .Falies/367. 11DG1550/11DG15537.Axiom=nt60
line = .Failies/367. 11DG1550/11DG15537.Axiom=nt50
I tried to extract the string at the end 'nt60', 'nt50'.
lines = line.split('=')
version = lines[-1]
the problem is that the end of line character will be included ('\n')
I thought of using regular expression search to match the string starting from ('=nt')
but I have no idea what shall I use to match a =, word, number.
Can anyone help?

Your first approach is absolutely fine. You can just use the string that you have extracted using your first method and then apply strip() to it:
strip() removes all leading and trailing whitespaces and newlines from a string.
>>> your_str = 'nt60\n'
>>> your_str.strip()
'nt60'
For your case:
lines = line.rsplit('=',1)
version = lines[-1].strip()

The regex to match a = nt then a number is:
=(nt\d+)
And in your example:
line = .Falies/367. 11DG1550/11DG15537.Axiom=nt60
line = .Failies/367. 11DG1550/11DG15537.Axiom=nt50
it will return two matches:
MATCH 1
1. [49-53] `nt60`
MATCH 2
1. [105-109] `nt50`
Explanation:
`=` matches the character `=` literally
1st Capturing group `(nt\d+)`
`nt` matches the characters `nt` literally (case sensitive)
`\d` match a digit `[0-9]`
`+` Quantifier: Between one and unlimited times, as many times as possible,
giving back as needed
if you want your regex to match a = word number then just replace the nt with \w+ to match any word.
hope this helps.

Related

regex capture info in text file after multiple blank lines

I open a complex text file in python, match everything else I need with regex but am stuck with one search.
I want to capture the numbers after the 'start after here' line. The space between the two rows is important and plan to split later.
start after here: test
5.7,-9.0,6.2
1.6,3.79,3.3
Code:
text = open(r"file.txt","r")
for line in text:
find = re.findall(r"start after here:[\s]\D+.+", line)
I tried this here https://regexr.com/ and it seems to work but it is for Java.
It doesn't find anything. I assume this is because I need to incorporate multiline but unsure how to read file in differently or incorporate. Have been trying many adjustments to regex but have not been successful.
import re
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
m = re.search(r'start after here:([^\n])+\n+(.*)', test_str)
new_str = m[2]
m = re.search(r'(-?\d*\.\d*,?\s*)+', new_str)
print(m[0])
The pattern start after here:[\s]\D+.+ matches the literal words and then a whitespace char using [\s] (you can omit the brackets).
Then 1+ times not a digit is matched, which will match until before 5.7. Then 1+ times any character except a newline will be matched which will match 5.7,-9.0,6.2 It will not match the following empty line and the next line.
One option could be to match your string and match all the lines after that do not start with a decimal in a capturing group.
\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*
The values including the empty line are in the first capturing group.
For example
import re
regex = r"\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*"
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
matches = re.findall(regex, test_str)
print(matches)
Result
['5.7,-9.0,6.2\n\n1.6,3.79,3.3']
Regex demo | Python demo
If you want to match the decimals (or just one or more digits) before the comma you might split on 1 or more newlines and use:
[+-]?(?:\d+(?:\.\d+)?|\.\d+)(?=,|$)
Regex demo

How to match numeric characters with no white space following

I need to match lines in text document where the line starts with numbers and the numbers are followed by nothing.... I want to include numbers that have '.' and ',' separating them.
Currently, I have:
p = re.compile('\$?\s?[0-9]+')
for i, line in enumerate(letter):
m = p.match(line)
if s !=None:
print(m)
print(line)
Which gives me this:
"15,704" and "416" -> this is good, I want this
but also this:
"$40 million...." -> I do not want to match this line or any line where the numbers are followed by words.
I've tried:
p = re.compile('\$?\s?[0-9]+[ \t\n\r\f\v]')
But it doesn't work. One reason is that it turns out there is no white space after the numbers I'm trying to match.
Appreciate any tips or tricks.
If you want to match the whole string with a regex,
you have 2 choices:
Either call re.fullmatch(pattern, string) (note full in the function name).
It tries to match just the whole string.
Or put $ anchor at the end of your regex and call re.match(pattern, string).
It tries to find a match from the start of the string.
Actually, you could also add ^ at the start of regex and call re.search(pattern,
string), but it would be a very strange combination.
I have also a remark concerning how you specified your conditions, maybe in incomplete
way: You put e.g. $40 million string and stated that the only reason to reject
it is space and letters after $40.
So actually you should have written that you want to match a string:
Possibly starting with $.
After the $ there can be a space (maybe, I'm not sure).
Then there can be a sequence of digits, dots or commas.
And nothing more.
And one more remark concerning Python literals: Apparently you have forgotten to prepend the pattern with r.
If you use r-string literal, you do not have to double backslashes inside.
So I think the most natural solution is to call a function devoted just to
match the whole string (i.e. fullmatch), without adding start / end
anchors and the whole script can be:
import re
pat = re.compile(r'(?:\$\s?)?[\d,.]+')
lines = ["416", "15,704", "$40 million"]
for line in lines:
if pat.fullmatch(line):
print(line)
Details concerning the regex:
(?: - A non-capturing group.
\$ - Consisting of a $ char.
\s? - And optional space.
)? - End of the non-capturing group and ? stating that the whole
group group is optional.
[\d,.]+ - A sequence of digits, commas and dots (note that between [
and ] the dot represents itself, so no backslash quotation is needed.
If you would like to reject strings like 2...5 or 3.,44 (no consecutive
dots or commas allowed), change the last part of the above regex to:
[\d]+(?:[,.]?[\d]+)*
Details:
[\d]+ - A sequence of digits.
(?: - A non-capturing group.
[,.] - Either a comma or a dot (single).
[\d]+ - Another sequence of digits.
)* - End of the non-capturing group, it may occur several times.
With a little modification to your code:
letter = ["15,704", "$40 million"]
p = re.compile('^\d{1,3}([\.,]\d{3})*$') # Numbers separated by commas or points
for i, line in enumerate(letter):
m = p.match(line)
if m:
print(line)
Output:
15,704
You could use the following regex:
import re
pattern = re.compile('^[0-9,.]+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^[0-9,.]+\s*$ matches everything that is a digit a , or ., followed by zero or more spaces. If you want to match only numbers with one , or . use the following pattern: '^\d+[,.]?\d+\s*$', code:
import re
pattern = re.compile('^\d+[,.]?\d+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^\d+[,.]?\d+\s*$ matches everything that starts with a group of digits (\d+) followed by an optional , or . ([,.]?) followed by a group of digits, with an optional group of spaces \s*.

split string in python when characters on either side of separator are not numbers

I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']

How to ignore multiple whitespace chars and words in python regex

I have a pattern which is looking for word1 followed by word2 followed by word3 with any number of characters in between.
My file however contains many random newline and other white space characters - which means that between word 1 and 2 or word 2 and 3 there could be 0 or more words and/or 0 or more newlines randomly
Why isn't this code working? (Its not matching anything)
strings = re.findall('word1[.\s]*word2[.\s]*word3', f.read())
[.\s]* - What I mean by this - find either '.'(any char) or '\s'(newline char) multiple times(*)
The reason why your reg ex is not working is because reg ex-es only try to match on a single line. They stop when they find a new line character (\n) and try to match the pattern on the new line starting from the beginning of the pattern.
In order to make the reg ex ignore the newline character you must add re.DOTALL as a third parameter to the findall function:
strings = re.findall('word1.*?word2.*?word3', f.read(), re.DOTALL)
You have two problems:
1) . doesn't mean anything special inside brackets [].
Change your [] to use () instead, like this: (.|\s)
2) \ doesn't mean what you think it does inside regular strings.
Try using raw strings:
re.findall(r'word1 ..blah..')
Notice the r prefix of the string.
Putting them together:
strings = re.findall(r'word1(.|\s)*word2(.|\s)*word3', f.read())
However, do note that this changes the returned list.

Regex find non digit and/or end of string

How do I include an end-of-string and one non-digit characters in a python 2.6 regular expression set for searching?
I want to find 10-digit numbers with a non-digit at the beginning and a non-digit or end-of-string at the end. It is a 10-digit ISBN number and 'X' is valid for the final digit.
The following do not work:
is10 = re.compile(r'\D(\d{9}[\d|X|x])[$|\D]')
is10 = re.compile(r'\D(\d{9}[\d|X|x])[\$|\D]')
is10 = re.compile(r'\D(\d{9}[\d|X|x])[\Z|\D]')
The problem arises with the last set: [\$|\D] to match a non-digit or end-of-string.
Test with:
line = "abcd0123456789"
m = is10.search(line)
print m.group(1)
line = "abcd0123456789efg"
m = is10.search(line)
print m.group(1)
You have to group the alternatives with parenthesis, not brackets:
r'\D(\d{9}[\dXx])($|\D)'
| is a different construct than []. It marks an alternative between two patterns, while [] matches one of the contained characters. So | should only be used inside of [] if you want to match the actual character |. Grouping of parts of patterns is done with parenthesis, so these should be used to restrict the scope of the alternative marked by |.
If you want to avoid that this creates match groups, you can use (?: ) instead:
r'\D(\d{9}[\dXx])(?:$|\D)'
\D(\d{10})(?:\Z|\D)
find non-digit followed by 10 digits, and a single non-digit or a end-of-string. Captures only digits. While I see that you're searching for nine digit followed by digit or X or x, I don't see same thing in your requirements.

Categories