This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I am learning regex in python. Meanwhile, on a stage, I produced the first regex statement and my tutorial says the second. Both produce the same result for the given string. What are the differences? What may be the string for, that these codes will produce different results?
>>> f = 'From m.rubayet94#gmail.com sat Jan'
>>> y = re.findall('^From .*#(\S+)',f); print(y)
['gmail.com']
>>> y = re.findall('^From .*#([^ ]*)',f); print(y)
['gmail.com']
[^ ]* means zero or more non-space characters.
\S+ means one or more non-whitespace characters.
It looks like you're aiming to match a domain name which may be part of an email address, so the second regex is the better choice between the two since domain names can't contain any whitespace like tabs \t and newlines \n, beyond just spaces. (Domain names can't contain other characters too, but that's beside the point.)
Here are some examples of the differences:
import re
p1 = re.compile(r'^From .*#([^ ]*)')
p2 = re.compile(r'^From .*#(\S+)')
for s in ['From eric#domain\nTo john#domain', 'From graham#']:
print(p1.findall(s), p2.findall(s))
In the first case, whitespace isn't handled properly: ['domain\nTo'] ['domain']
In the second case, you get a null match where you shouldn't: [''] []
One of the regexes uses [^ ] while the other uses (\S+). I assume that at that point you're trying to match against anything but a whitespace.
The difference between both expressions is that (\S+) will match against anything that isn't any whitespace chracters (whitespace characteres are [ \t\n\r\f\v], you can read more here). [^ ] will match against anything that isn't a single whitespace character (i.e. a whitespace produced by pressing the spacebar).
Related
This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 2 years ago.
Can some one please help me on this - Here I'm trying extract word from given sentence which contains G,ML,KG,L,ML,PCS along with numbers .
I can able to match the string , but not sure how can I extract the comlpete word
for example my input is "This packet contains 250G Dates" and output should be 250G
another example is "You paid for 2KG Apples" and output should be 2KG
in my regular expression I'm getting only match string not complete word :(
import re
val = 'FUJI ALUMN FOIL CAKE, 240G, CHCLTE'
key_vals = ['G','GM','KG','L','ML','PCS']
re.findall("\d+\.?\d*(\s|G|KG|GM|L|ML|PCS)\s?", val)
This regex will not get you what you want:
re.findall("\d+\.?\d*(\s|G|KG|GM|L|ML|PCS)\s?", val)
Let's break it down:
\d+: one or more digits
\.?: a dot (optional, as indicated by the question mark)
\d*: one or more optional digits
(\s|G|KG|GM|L|ML|PCS): a group of alternatives, but whitespace is an option among others, it should be out of the group: what you probably want is allow optional whitespace between the number and the unit ie: 240G or 240 G
\s?: optional whitespace
A better expression for your purpose could be:
re.findall("\d+\s*(?:G|KG|GM|L|ML|PCS)", val)
That means: one or more digits, followed by optional whitespace and then either of these units: G|KG|GM|L|ML|PCS.
Note the presence of ?: to indicate a non-capturing group. Without it the expression would return G
Try using this Regex:
\d+\s*(G|KG|GM|L|ML|PCS)\s?
It matches every string which starts with at least one digit, is then followed by one the units. Between the digits and the units and behind the units there can also be whitespaces.
Adjust this like you want to :)
Use non-grouping parentheses (?:...) instead of the normal ones. Without grouping parentheses findall returns the string(s) which match the whole pattern.
This question already has answers here:
Python non-greedy regexes
(7 answers)
Closed 3 years ago.
I am trying to find all strings that follows a specific pattern in a python string.
"\[\[Cats:.*\]\]"
However if many occurrences of such pattern exist together on a line in a string it sees the pattern as just one, instead of taking the patterns separately.
strng = '[[Cats: Text1]] said I am in the [[Cats: Text2]]fhg is abnorn'
x = re.findall("\[\[Cats:.*\]\]", strng)
The output gotten is:
['[[Cats: Text1]] said I am in the [[Cats: Text2]]']
instead of
['[[Cats: Text1]]', '[[Cats: Text2]]']
which is a list of lists.
What regex do I use?
"\[\[Cats:.*?\]\]"
Your current regex is greedy - it's gobbling up EVERYTHING, from the first open brace to the last close brace. Making it non-greedy should return all of your results.
Demo
The problom is that you are doing a greedy search, add a ? after .* to get a non greedy return.
code follows:
import re
strng = '[[Cats: Text1]] said I am in the [[Cats: Text2]]fhg is abnorn'
regex_template = re.compile(r'\[\[Cats:.*?\]\]')
matches = re.findall(regex_template, strng)
print(matches)
Don't do .*, because that will never terminate. .* means any character and not even one occurence is required.
import re
strng = '''[[Cats: lol, this is 100 % cringe]]
said I am in the [[Cats: lol, this is 100 % cringe]]
fhg is abnorn'''
x = re.findall(r"\[\[Cats: [^\]]+\]\]", strng)
print(x)
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
Suppose a string:
s = 'F3·Compute·Introduction to Methematical Thinking.pdf'
I substitute F3·Compute· with '' using regex
In [23]: re.sub(r'F3?Compute?', '',s)
Out[23]: 'F3·Compute·Introduction to Methematical Thinking.pdf'
It failed to work as I intented
When tried,
In [21]: re.sub(r'F3·Compute·', '', 'F3·Compute·Introduction to Methematical Thinking.pdf')
Out[21]: 'Introduction to Methematical Thinking.pdf'
What's the problem with my regex pattern?
The question mark ? does not stand in for a single character in regular expressions. It means 0 or 1 of the previous character, which in your case was 3 and e. Instead, the . is what you're looking for. It is a wildcard that stands for a single character (and has nothing to do with your middle-dot character; that is just coincidence).
re.sub(r'F3.Compute.', '',s)
Use dot to match any single character:
#coding: utf-8
import re
s = 'F3·Compute·Introduction to Methematical Thinking.pdf'
output = re.sub(r'F3.Compute.', '', unicode(s,"utf-8"), flags=re.U)
print output
Your original pattern, 'F3?Compute? was not having the desired effect. This said to match F followed by the number 3 optionally. Also, you made the final e of Compute optional. In any case, you were not accounting for the separator characters.
Note also that we must match on the unicode version of the string, and not the string directly. Without doing this, a dot won't match the unicode separator which you are trying to target. Have a look at the demo below for more information.
Demo
I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']
I am trying to use re.findall to find this pattern:
01-234-5678
regex:
(\b\d{2}(?P<separator>[-:\s]?)\d{2}(?P=separator)\d{3}(?P=separator)\d{3}(?:(?P=separator)\d{4})?,?\.?\b)
however, some cases have shortened to 01-234-5 instead of 01-234-0005 when the last four digits are 3 zeros followed by a non-zero digit.
Since there does't seem to be any uniformity in formatting I had to account for a few different separator characters or possibly none at all. Luckily, I have only noticed this shortening when some separator has been used...
Is it possible to use a regex conditional to check if a separator does exist (not an empty string), then also check for the shortened variation?
So, something like if separator != '': re.findall(r'(\b\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)(\d{4}|\d{1})\.?\b)', text)
Or is my only option to include all the possibly incorrect 6 digit patterns then check for a separator with python?
If you want the last group of digits to be "either one or four digits", try:
>>> import re
>>> example = "This has one pattern that you're expecting, 01-234-5678, and another that maybe you aren't: 23:456:7"
>>> pattern = re.compile(r'\b(\d{2}(?P<sep>[-:\s]?)\d{3}(?P=sep)\d(?:\d{3})?)\b')
>>> pattern.findall(example)
[('01-234-5678', '-'), ('23:456:7', ':')]
The last part of the pattern, \d(?:\d{3})?), means one digit, optionally followed by three more (i.e. one or four). Note that you don't need to include the optional full stop or comma, they're already covered by \b.
Given that you don't want to capture the case where there is no separator and the last section is a single digit, you could deal with that case separately:
r'\b(\d{9}|\d{2}(?P<sep>[-:\s])\d{3}(?P=sep)\d(?:\d{3})?)\b'
# ^ exactly nine digits
# ^ or
# ^ sep not optional
See this demo.
It is not clear why you are using word boundaries, but I have not seen your data.
Otherwise you can shorten the entire this to this:
re.compile(r'\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)\d{1,4}')
Note that \d{1,4} matched a string with 1, 2, 3 or 4 digits
If there is no separator, e.g. "012340008" will match the regex above as you are using [-:\s]? which matches 0 or 1 times.
HTH