Regular expression extracting number dimension - python

I'm using python regular expressions to extract dimensional information from a database. The entries in that column look like this:
23 cm
43 1/2 cm
20cm
15 cm x 30 cm
What I need from this is only the width of the entry (so for the entries with an 'x', only the first number), but as you can see the values are all over the place.
From what I understood in the documentation, you can access the groups in a match using their position, so I was thinking I could determine the type of the entry based on how many groups are returned and what is found at each index.
The expression I used so far is ^(\d{2})\s?(x\s?(\d{2}))?(\d+/\d+)?$, however it's not perfect and it returns a number of useless groups. Is there something more efficient and appropriate?
Edit: I need the number from every line. When there is only one number, it is implied that only the width was measured (including any fractional components such as line 2). When there are two numbers, the height was also measured, but I only need the width which is the first number (such as in the last line)

try regex below, it will capture 1st digits and optional fractional come after it before the 1st 'cm'
import re
regex = re.compile('(\d+.*?)\s?cm') # this will works for all your example data
# or
# this asserted whatever come after the 1st digit group must be fractional number only
regex = re.compile('(\d+(?:\s+\d+\/\d+)?)\s?cm')
>>> regex.match('23 cm').group(1)
>>> '23'
>>> regex.match('43 1/2 cm').group(1)
>>> '43 1/2'
>>> regex.match('20cm').group(1)
>>> '20'
>>> regex.match('15 cm x 30 cm').group(1)
>>> '15'
regex101 demo

This regex should work (Live Demo)
^(\d+)(?:\s*cm\s+[xX])
Explanation
^(\d+) - capture at least one digit at the beginning of the line
(?: - start non-capturing group
\s* - followed by at least zero whitespace characters
cm - followed by a literal c and m
\s+ - followed by at least one whitespace character
[xX] - followed by a literal x or X
) - end non-capturing group
You shouldn't need to bother matching the rest of the line.

Here's a sample of how to do it from a text file.
It works for the provided data.
f = open("textfile.txt",r')
for line in f :
if 'x'in line:
iposition = line.find('x')
print(line[:iposition])

Related

Capturing repeated pattern in Python

I'm trying to implement some kind of markdown like behavior for a Python log formatter.
Let's take this string as example:
**This is a warning**: Virus manager __failed__
A few regexes later the string has lost the markdown like syntax and been turned into bash code:
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
But that should be compressed to
\033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m
I tried these, beside many other non working solutions:
(\\033\[([\d]+)m){2,} => Capture: \033[33m\033[1m with g1 '\033[1m' and g2 '1' and \033[0m\033[0mwith g1 '\033[0m' and g2 '0'
(\\033\[([\d]+)m)+ many results, not ok
(?:(\\033\[([\d]+)m)+) many results, although this is the recommended way for repeated patterns if I understood correctly, not ok
and others..
My goal is to have as results:
Input
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
Output
Match 1
033[33m\033[1m
Group1: 33
Group2: 1
Match 2
033[0m\033[0m
Group1: 0
Group2: 0
In other words, capture the ones that are "duplicated" and not the ones alone, so I can fuse them with a regex sub.
You want to match consectuively repeating \033[\d+m chunks of text and join the numbers after [ with a semi-colon.
You may use
re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text)
See the Python demo online
The (?:\\033\[\d+m){2,} pattern will match two or more sequences of \033[ + one or more digits + m chunks of texts and then, the match will be passed to the lambda expression, where the output will be: 1) \033[, 2) all the numbers after [ extracted with re.findall(r"\[(\d+)", m.group()) and deduplicated with the set, and then 3) m.
The patterns in the string to be modified have not been made clear from the question. For example, is 033 fixed or might it be 025 or even 25? I've made certain assumptions in using the regex
r" ^(\\0(\d+)\[\2)[a-z]\\0\2\[(\d[a-z].+)
to obtain two capture groups that are to be combined, separated by a semi-colon. I've attempted to make clear my assumptions below, in part to help the OP modify this regex to satisfy alternative requirements.
Demo
The regex performs the following operations:
^ # match beginning of line
( # begin cap grp 1
\\0 # match '\0'
(\d+) # match 1+ digits in cap grp 2
\[ # match '['
\2 # match contents of cap grp 2
) # end cap grp 1
[a-z] # match a lc letter
\\0 # match '\0'
\2 # match contents of cap grp 2
\[ # match '['
(\d[a-z].+) # match a digit, then lc letter then 1+ chars to the
# end of the line in cap grp 3
As you see, the portion of the string captured in group 1 is
\033[33
I've assumed that the part of this string that is now 033 must be two or more digits beginning with a zero, and the second appearance of a string of digits consists of the same digits after the zero. This is done by capturing the digits following '0' (33) in capture group 2 and then using a back-reference \2.
The next part of the string is to be replaced and therefore is not captured:
m\\033[
I've assumed that m must be one lower case letter (or should it be a literal m?), the backslash and zero and required and the following digits must again match the content of capture group 2.
The remainder of the string,
1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
is captured in capture group 3. Here I've assumed it begins with one digit (perhaps it should be \d+) followed by one lower case letter that needn't be the same as the lower case letter matched earlier (though that could be enforced with another capture group). At that point I match the remainder of the line with .+, having given up matching patterns in that part of the string.
One may alternatively have just two capture groups, the capture group that is now #2, becoming #1, and #2 being the part of the string that is to be replaced with a semicolon.
This is pretty straightforward for the cases you desribe here; simply write out from left to right what you want to match and capture. Repeating capturing blocks won't help you here, because only the most recently captured values would be returned as a result.
\\033\[(\d+)m\\033\[(\d+)m

Use regex to identify 4 to 5 numbers that are (consecutive, i.e no whitespace or special characters included), without including preceding 0's

I am trying to use regular expressions to identify 4 to 5 digit numbers. The code below is working effectively in all cases unless there are consecutive 0's preceding a one, two or 3 digit number. I don't want '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000' to all be matches. Is there a good way to implement this using regular expressions? Here is my current code that works for most cases except when there are preceding 0's to a series of digits less than 4 or 5 characters in length.
import re
line = 'US Machine Operations | 0054'
match = re.search(r'\d{4,5}', line)
if match is None:
print(0)
else:
print(int(match[0]))
You may use
(?<!\d)[1-9]\d{3,4}(?!\d)
See the regex demo.
NOTE: In Pandas str.extract, you must wrap the part you want to be returned with a capturing group, a pair of unescaped parentheses. So, you need to use
(?<!\d)([1-9]\d{3,4})(?!\d)
^ ^
Example:
df2['num_col'] = df2.Warehouse.str.extract(r'(?<!\d)([1-9]\d{3,4})(?!\d)', expand = False).astype(float)
Just because you can simple use a capturing group, you may use an equivalent regex:
(?:^|\D)([1-9]\d{3,4})(?!\d)
Details
(?<!\d) - no digit immediately to the left
or (?:^|\D) - start of string or non-digit char (a non-capturing group is used so that only 1 capturing group could be accommodated in the pattern and let str.extract only extract what needs extracting)
[1-9] - a non-zero digit
\d{3,4} - three or four digits
(?!\d) - no digit immediately to the right is allowed
Python demo:
import re
s = "US Machine Operations | 0054 '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000'"
print(re.findall(r'(?<!\d)[1-9]\d{3,4}(?!\d)', s))
# => ['10354', '10032', '9005', '9000']

Match a date with the same character separating the values

I have to find dates in multiple formats in a text.
I have some regex like this one:
# Detection of:
# 25/02/2014 or 25/02/14 or 25.02.14
regex = r'\b(0?[1-9]|[12]\d|3[01])[-/\._](0?[1-9]|1[012])[-/\._]((?:19|20)\d\d|\d\d)\b'
The problem is that it also matches dates like 25.02/14 which is not good because the splitting character is not the same.
I could of course do multiple regex with a different splitting character for every regex, or do a post-treatment on the matching results, but I would prefer a complete solution using only one good regex. Is there a way to do so?
In addition to my comment (the original word boundary approach lets the pattern match "dates" that are in fact parts of other entities, like IPs, serial numbers, product IDs, etc.), see the improved version of your regex in comparison with yours:
import re
s = '25.02.19.35 6666-20-03-16-67875 25.02/2014 25.02/14 11/12/98 11/12/1998 14/12-2014 14-12-2014 14.12.1998'
found_dates = [m.group() for m in re.finditer(r'\b(?:0?[1-9]|[12]\d|3[01])([./-])(?:0?[1-9]|1[012])\1(?:19|20)?\d\d\b', s)]
print(found_dates) # initial regex
found_dates = [m.group() for m in re.finditer(r'(?<![\d.-])(?:0?[1-9]|[12]\d|3[01])([./-])(?:0?[1-9]|1[012])\1(?:19|20)?\d\d(?!\1\d)', s)]
print(found_dates) # fixed boundaries
# = >['25.02.19', '20-03-16', '11/12/98', '11/12/1998', '14-12-2014', '14.12.1998']
# => ['11/12/98', '11/12/1998', '14-12-2014', '14.12.1998']
See, your regex extracts '25.02.19' (part of a potential IP) and '20-03-16' (part of a potential serial number/product ID).
Note I also shortened the regex and extraction code a bit.
Pattern details:
(?<![\d.-]) - a negative lookbehind making sure there is no digit, .
and - immediately to the left of the current location (/ has been discarded since dates are often found inside URLs)
(?:0?[1-9]|[12]\d|3[01]) - 01 / 1 to 31 (day part)
([./-]) - Group 1 (technical group to hold the separator value) matching either ., or / or -
(?:0?[1-9]|1[012]) - month part: 01 / 1 to 12
\1 - backreference to the Group 1 value to make sure the same separator comes here
(?:19|20)?\d\d - year part: 19 or 20 (optional values) and then any two digits.
(?!\1\d) - negative lookahead making sure there is no separator (captured into Group 1) followed with any digit immediately to the right of the current location.
Based on the comment of Rawing, this did the trick:
regex = r'\b(0?[1-9]|[12]\d|3[01])([./-])(0?[1-9]|1[012])\2((?:19|20)\d\d|\d\d)\b'
So, the complete code is:
import re
s = '25.02/2014 25.02/14 11/12/98 11/12/1998 14/12-2014 14-12-2014 14.12.1998'
found_dates = []
for m in re.finditer(r'\b(0?[1-9]|[12]\d|3[01])([./-])(0?[1-9]|1[012])\2((?:19|20)\d\d|\d\d)\b', s):
found_dates.append(m.group(0))
print(found_dates)
The output is, as desired :
['11/12/98', '11/12/1998', '14-12-2014', '14.12.1998']

regex for roman numberal not working

I am trying to identify roman numberals from text with the following regex:
>>>Title="LXXXIV XC, XCII XXX LXII"
>>>RomanNum = re.findall(r'[\s,]+M{0,4}[CM|CD|D?C{0,3}]?[XC|XL|L?X{0,3}]?[IX|IV|V?I{0,3}]?[\s,]+', Title, re.M|re.I)`
>>>RomanNum
[' \t']
I want something like:
['LXXXIV', 'XC, 'XCII', 'XXX', 'LXII']
As far as my understanding of regular expression is concerned I think at least XC should have been matched. XC should match [XC|XL|L?X{0,3}] part of regular expression above with whitespace before and a comma after it which is captured by the above regex. What am I missing?
Apart from that I can achieve the desired result as following(but greater complexity which I want to avoid):
>>>RomanNum = [re.search(r'^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$', TitleElem, re.M|re.I) for TitleElem in re.split(',| ', Title)]`
Any help appreciated.
Your regex syntax is off at this point:
XC should match [XC|XL|L?X{0,3}]
because you use square brackets, where you describe the behavior of round parentheses. Change the square brackets to round ones to correct.
This error is repeated in other parts of your full regex.
If you want to find several roman numbers in a string with the findall or finditer method, one possible pattern is:
(?=[MDCXLVI])(?<![MDCXLVI])M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})(?![MDCXLVI])
It's a bit long and I will explain why I think it is efficient:
(?=[MDCXLVI]) is a lookahead that checks if the position is followed by one of these characters. This lookahead has two functions:
The first is to emulate a kind of first-character discrimination to quickly avoid all positions that don't contain one of these characters (In this way, the regex engine don't need to test all possible beginings with M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})).
The second checks if there is at least one character, since M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3}) can match an empty string.
(?<![MDCXVLI]) and (?![MDCXVLI]) are used as boundaries to ensure there are no other "roman characters" around (otherwise a substring like ILVIII will return LVIII as result instead of skipping the entire group of characters with a wrong format). Note that other kind of boundaries are possible, like \b or (?<![^\s,]) (?![^\s,]) ... depending of the string format. Note too, that the left boundary is placed only after (?=[MDCXVLI]) to not break the first-character discrimination.
Alternations like CM|CD are reduced to C[MD].
The pattern use only non-capturing groups (?:...) to preserve memory and avoid uneeded storage tasks.
Dive Into Python provides a nice regex for detecting Roman Numerals. They also provide a sample script that you can utilize to start. This script comes from section 7.5 of my first link.
#Define pattern to detect valid Roman numerals
romanNumeralPattern = re.compile("""
^ # beginning of string
M{0,4} # thousands - 0 to 4 M's
(CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
# or 500-800 (D, followed by 0 to 3 C's)
(XC|XL|L?X{0,3}) # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
# or 50-80 (L, followed by 0 to 3 X's)
(IX|IV|V?I{0,3}) # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
# or 5-8 (V, followed by 0 to 3 I's)
$ # end of string
""" ,re.VERBOSE)

How to add additional criteria to re.findall ... Python 2.7?

ORF_sequences = re.findall(r'ATG(?:...){9,}?(?:TAA|TAG|TGA)',sequence) #thanks to #Martin Pieters and #nneonneo
I have a line of code that finds any instance of A|G followed by 2 characters and then ATG that is then followed by either a TAA|TAG|TGA when read in units of 3. only works when A|G-xx-ATG-xxx-TAA|TAG|TGA is 30 elements or greater
i want to add a criteria
i need the ATG to be followed by a G
so A|G-xx-ATG-Gxx-xxx-TAA|TGA|TAG #at least 30 elements long
example:
GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA
^ would work
GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA
^ would not work because it is an (A|G) followed by only one value (not 2) before the ATG and there is not a G following the A|G-xx-ATG
i hope this makes sense
I tried
ORF_sequences = re.findall(r'ATGG(?:...){9,}?(?:TAA|TAG|TGA)',sequence)
but it seemed like it was using window size 3 after last G of ATGG
basically I need that code, where the first occurrence is A|G-xx-ATG and the second occurrence is (G-xx)
It'll be easier if you use a character group of [AG], there is no need to group the two 'free' characters:
ORF_sequences2 = re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
or you need to group the A|G:
ORF_sequences2 = re.findall(r'(?:A|G)..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
Applying the first form to your examples:
>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']
>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCATGGGGTTTTGA')
[]
In your attempt, the expression matches either an A, or the expression G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA) because the | symbol applies to everything that preceeds or follows it within the same group. As it is not grouped, it applies to the whole expression instead:
>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'A')
['A']
>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']
If you need to match a certain amount of characters in your whole match, you need to tailor those 3 character (?:...) groups to match a minimum number of times:
ORF_sequences2 = re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)',fdna)
would match A or G followed by 2 characters, followed by ATGG with another 2 characters, then at least 7 times 3 characters (total 21), followed by a specific pattern of 3 more (TAA, TAG or TGA) for a total of at least 33 characters from the first to the last character. The extra .. make up the pattern of 3 after ATG and matches your example from your comment:
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA']
as well as correctly handling the examples given in your question:
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA']
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
[]
To ensure you get at least 30 characters, use the {n,} quantifier:
r'[AG]..ATG(?:...){9,}?(?:TAA|TAG|TGA)'
This ensures that you read at least 9 triplets (27 characters) between the ATG opening and the TAA|TGA|TAG terminator.

Categories