I am trying to extract first 5 character+digit from last hyphen.
Here is the example
String -- X008-TGa19-ER751QF7
Output -- X008-TGa19-ER751
String -- X002-KF13-ER782cPU80
Output -- X002-KF13-ER782
My attempt -- I could manage to take element from the last -- (\w+)[^-.]*$
But now how to take first 5, then return my the entire value as the output as shown in the example.
You can optionally repeat a - and 1+ word chars from the start of the string. Then match the last - and match 5 word chars.
^\w+(?:-\w+)*-\w{5}
^ Start of string
\w+ Math 1+ word chars
(?:-\w+)* Optionally repeat - and 1+ word chars
-\w{5} Match - and 5 word chars
Regex demo
import re
regex = r"^\w+(?:-\w+)*-\w{5}"
s = ("X008-TGa19-ER751QF7\n"
"X002-KF13-ER782cPU80")
print(re.findall(regex, s, re.MULTILINE))
Output
['X008-TGa19-ER751', 'X002-KF13-ER782']
Note that \w can also match _.
If there can also be other character in the string, to get the first 5 digits or characters except _ after the last hyphen, you can match word characters without an underscore using a negated character class [^\W_]{5}
Repeat that 5 times while asserting no more underscore at the right.
^.*-[^\W_]{5}(?=[^-]*$)
Regex demo
(\w+-\w+-\w{5}) seems to capture what you're asking for.
Example:
https://regex101.com/r/PcPSim/1
If you are open for non-regex solution, you can use this which is based on splitting, slicing and joining the strings:
>>> my_str = "X008-TGa19-ER751QF7"
>>> '-'.join(s[:5] for s in my_str.split('-'))
'X008-TGa19-ER751'
Here I am splitting the string based on hyphen -, slicing the string to get at max five chars per sub-string, and joining it back using str.join() to get the string in your desired format.
^(.*-[^-]{5})[^-]*$
Capture group 1 is what you need
https://regex101.com/r/SYz9i5/1
Explanation
^(.*-[^-]{5})[^-]*$
^ Start of line
( Capture group 1 start
.* Any number of any character
- hyphen
[^-]{5} 5 non-hyphen character
) Capture group 1 end
[^-]* Any number of non-hyphen character
$ End of line
Another simpler one is
^(.*-.{5}).*$
This should be quite straight-forward.
This is making use of behaviour greedy match of first .*, which will try to match as much as possible, so the - will be the last one with at least 5 character following it.
https://regex101.com/r/CFqgeF/1/
I have the strings 'amount $165' and 'amount on 04/20' (and a few other variations I have no issues with so far). I want to be able to run an expression and return the numerical amount IF available (in the first string it is 165) and return nothing if it is not available AND make sure not to confuse with a date (second string). If I write the code as following, it returns the 165 but it also returns 04 from the second.
amount_search = re.findall(r'amount.*?(\d+)[^\d?/]?, string)
If I write it as following, it includes neither
amount_search = re.findall(r'amount.*?(\d+)[^\d?/], string)
How to change what I have to return 165 but not 04?
To capture the whole number in a group, you could match amount followed by matching all chars except digits or newlines if the value can not cross newline boundaries.
Capture the first encountered digits in a group and assert a whitespace boundary at the right.
\bamount [^\d\r\n]*(\d+)(?!\S)
In parts
\bamount Match amount followed by a space and preceded with a word boundary
[^\d\r\n]* Match 0 or more times any char except a digit or newlines
(\d+) Capture group 1, match 1 or more digits
(?!\S) Assert a whitespace boundary on the right
Regex demo
try this ^amount\W*\$([\d]{1,})$
the $ indicate end of line, for what I have tested, use .* or ? also work.
by grouping the digits, you can eliminate the / inside the date format.
hope this helps :)
Try this:
from re import sub
your_digit_list = [int(sub(r'[^0-9]', '', s)) for s in str.split() if s.lstrip('$').isdigit()]
I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']
I have a text file including thousands of lines. here's an example
line = .Falies/367. 11DG1550/11DG15537.Axiom=nt60
line = .Failies/367. 11DG1550/11DG15537.Axiom=nt50
I tried to extract the string at the end 'nt60', 'nt50'.
lines = line.split('=')
version = lines[-1]
the problem is that the end of line character will be included ('\n')
I thought of using regular expression search to match the string starting from ('=nt')
but I have no idea what shall I use to match a =, word, number.
Can anyone help?
Your first approach is absolutely fine. You can just use the string that you have extracted using your first method and then apply strip() to it:
strip() removes all leading and trailing whitespaces and newlines from a string.
>>> your_str = 'nt60\n'
>>> your_str.strip()
'nt60'
For your case:
lines = line.rsplit('=',1)
version = lines[-1].strip()
The regex to match a = nt then a number is:
=(nt\d+)
And in your example:
line = .Falies/367. 11DG1550/11DG15537.Axiom=nt60
line = .Failies/367. 11DG1550/11DG15537.Axiom=nt50
it will return two matches:
MATCH 1
1. [49-53] `nt60`
MATCH 2
1. [105-109] `nt50`
Explanation:
`=` matches the character `=` literally
1st Capturing group `(nt\d+)`
`nt` matches the characters `nt` literally (case sensitive)
`\d` match a digit `[0-9]`
`+` Quantifier: Between one and unlimited times, as many times as possible,
giving back as needed
if you want your regex to match a = word number then just replace the nt with \w+ to match any word.
hope this helps.
I have a pattern which is looking for word1 followed by word2 followed by word3 with any number of characters in between.
My file however contains many random newline and other white space characters - which means that between word 1 and 2 or word 2 and 3 there could be 0 or more words and/or 0 or more newlines randomly
Why isn't this code working? (Its not matching anything)
strings = re.findall('word1[.\s]*word2[.\s]*word3', f.read())
[.\s]* - What I mean by this - find either '.'(any char) or '\s'(newline char) multiple times(*)
The reason why your reg ex is not working is because reg ex-es only try to match on a single line. They stop when they find a new line character (\n) and try to match the pattern on the new line starting from the beginning of the pattern.
In order to make the reg ex ignore the newline character you must add re.DOTALL as a third parameter to the findall function:
strings = re.findall('word1.*?word2.*?word3', f.read(), re.DOTALL)
You have two problems:
1) . doesn't mean anything special inside brackets [].
Change your [] to use () instead, like this: (.|\s)
2) \ doesn't mean what you think it does inside regular strings.
Try using raw strings:
re.findall(r'word1 ..blah..')
Notice the r prefix of the string.
Putting them together:
strings = re.findall(r'word1(.|\s)*word2(.|\s)*word3', f.read())
However, do note that this changes the returned list.